Published: January 2, 2025
It's 3 AM. PagerDuty fires. Your on-call engineer wakes up, reads the alert, remembers there's a runbook somewhere in Confluence, searches for 10 minutes, finds an outdated doc from 2022, tries the steps, realizes the commands don't work, escalates to senior engineer, who manually fixes it.
Mean Time To Resolution (MTTR): 45 minutes. Most of that was searching for runbooks and figuring out what to do.
With automated runbooks: Alert fires ā Slack bot suggests remediation ā Engineer clicks "Run" ā Issue fixed in 3 minutes. MTTR reduced by 70%.
Traditional runbooks are markdown files in Confluence. They rot quickly. Runbooks as code are executable scripts that live in Git, get tested in CI, and run automatically.
# runbooks/restart-api-pod.sh #!/bin/bash set -e # Metadata for automation # @trigger alert:api_high_memory # @severity P2 # @approval_required false echo "š Checking API pod health..." # Get pod name POD=$(kubectl get pods -n production -l app=api --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}') echo "š Current memory usage:" kubectl top pod $POD -n production echo "š Restarting pod $POD..." kubectl delete pod $POD -n production echo "ā³ Waiting for new pod to be ready..." kubectl wait --for=condition=ready pod -l app=api -n production --timeout=60s echo "ā Pod restarted successfully" kubectl get pods -n production -l app=api
runbooks/ āāā README.md # Index of all runbooks āāā common/ ā āāā gather-context.sh # Fetch logs, metrics, traces ā āāā notify-slack.sh # Post to incident channel āāā api/ ā āāā high-memory.sh # Restart pod when OOM ā āāā rate-limited.sh # Scale up replicas ā āāā database-connection.sh # Reset connection pool āāā database/ ā āāā high-cpu.sh # Analyze slow queries ā āāā replication-lag.sh # Force sync replica ā āāā disk-space.sh # Archive old logs āāā network/ āāā dns-resolution.sh # Flush DNS cache āāā ssl-certificate.sh # Renew expiring cert # Each runbook: # 1. Is executable (chmod +x) # 2. Has metadata comments (@trigger, @severity) # 3. Outputs human-readable status # 4. Returns exit code 0 on success
ChatOps = Operations via chat. When an alert fires, a bot posts to Slack with suggested runbooks. Engineer clicks a button, bot runs the script, posts output.
# Deploy Slack bot (using Bolt framework) # bot.py from slack_bolt import App from slack_bolt.adapter.socket_mode import SocketModeHandler import subprocess import os app = App(token=os.environ["SLACK_BOT_TOKEN"]) @app.command("/runbook") def handle_runbook(ack, command, client): ack() runbook_name = command['text'] runbook_path = f"runbooks/{runbook_name}.sh" if not os.path.exists(runbook_path): client.chat_postMessage( channel=command['channel_id'], text=f"ā Runbook '{runbook_name}' not found" ) return # Post confirmation with button client.chat_postMessage( channel=command['channel_id'], text=f"š¤ Ready to run runbook: `{runbook_name}`", blocks=[ { "type": "section", "text": {"type": "mrkdwn", "text": f"Run `{runbook_name}`?"} }, { "type": "actions", "elements": [ { "type": "button", "text": {"type": "plain_text", "text": "ā Run"}, "style": "primary", "action_id": f"run_runbook:{runbook_name}" }, { "type": "button", "text": {"type": "plain_text", "text": "ā Cancel"}, "action_id": "cancel" } ] } ] ) @app.action("run_runbook:*") def handle_run_runbook(ack, action, client, body): ack() runbook_name = action['action_id'].split(':')[1] channel = body['channel']['id'] user = body['user']['name'] client.chat_postMessage( channel=channel, text=f"š Running `{runbook_name}` (started by @{user})..." ) try: result = subprocess.run( [f"runbooks/{runbook_name}.sh"], capture_output=True, text=True, timeout=300 ) client.chat_postMessage( channel=channel, text=f"ā Runbook completed\n```\n{result.stdout}\n```" ) except Exception as e: client.chat_postMessage( channel=channel, text=f"ā Runbook failed: {str(e)}" ) if __name__ == "__main__": handler = SocketModeHandler(app, os.environ["SLACK_APP_TOKEN"]) handler.start()
When PagerDuty fires an alert, automatically post to Slack with relevant runbooks based on alert labels.
# pagerduty-webhook-handler.py from flask import Flask, request import requests import os app = Flask(__name__) RUNBOOK_MAP = { "high_memory": "api/high-memory", "high_cpu": "database/high-cpu", "disk_full": "database/disk-space", "ssl_expiring": "network/ssl-certificate" } @app.route('/webhook', methods=['POST']) def handle_pagerduty_webhook(): event = request.json if event['event'] == 'incident.triggered': incident = event['incident'] alert_key = incident.get('alert_key', '') # Find matching runbook runbook = None for key, rb in RUNBOOK_MAP.items(): if key in alert_key: runbook = rb break message = f""" šØ *Incident: {incident['title']}* Severity: {incident['urgency']} Service: {incident['service']['name']} š Context: ⢠Incident URL: {incident['html_url']} ⢠Triggered: {incident['created_at']} """ if runbook: message += f""" š¤ *Suggested Runbook:* `{runbook}` Run with: `/runbook {runbook}` """ # Post to Slack requests.post( 'https://slack.com/api/chat.postMessage', headers={'Authorization': f"Bearer {os.environ['SLACK_BOT_TOKEN']}"}, json={ 'channel': '#incidents', 'text': message } ) return {'status': 'ok'} if __name__ == '__main__': app.run(port=5000)
For common, low-risk issues, don't wait for human intervention. Run runbooks automatically.
# Alertmanager config with auto-remediation # alertmanager.yml route: receiver: 'slack' routes: - match: severity: P3 auto_remediate: true receiver: 'auto-remediate' continue: true # Also send to Slack receivers: - name: 'auto-remediate' webhook_configs: - url: 'http://runbook-executor:8080/auto-remediate' send_resolved: false - name: 'slack' slack_configs: - api_url: '<webhook_url>' channel: '#alerts' # runbook-executor service # auto-remediate.py from flask import Flask, request import subprocess import logging app = Flask(__name__) AUTO_REMEDIATE_MAP = { "api_high_memory": "api/high-memory.sh", "disk_space_warning": "database/disk-space.sh", "pod_crashloop": "common/restart-pod.sh" } @app.route('/auto-remediate', methods=['POST']) def auto_remediate(): alert = request.json['alerts'][0] alert_name = alert['labels']['alertname'] if alert_name in AUTO_REMEDIATE_MAP: runbook = AUTO_REMEDIATE_MAP[alert_name] logging.info(f"Auto-remediating {alert_name} with {runbook}") try: result = subprocess.run( [f"runbooks/{runbook}"], capture_output=True, text=True, timeout=300 ) # Post result to Slack post_to_slack( f"š¤ Auto-remediation for {alert_name}\n" f"Status: {'ā Success' if result.returncode == 0 else 'ā Failed'}\n" f"```{result.stdout}```" ) return {'status': 'executed', 'exit_code': result.returncode} except Exception as e: logging.error(f"Auto-remediation failed: {e}") post_to_slack(f"ā Auto-remediation failed: {str(e)}") return {'status': 'error', 'message': str(e)}, 500 return {'status': 'no_runbook_found'}, 404
When an incident fires, automatically gather context: recent logs, metrics, traces, config changes.
# runbooks/common/gather-context.sh #!/bin/bash SERVICE=$1 NAMESPACE=$2 TIME_WINDOW="15m" echo "š Gathering context for $SERVICE in $NAMESPACE..." # Recent logs (errors only) echo "\nš Recent ERROR logs:" kubectl logs -n $NAMESPACE -l app=$SERVICE --tail=50 --since=$TIME_WINDOW | grep ERROR # Pod status echo "\nš¦ Pod status:" kubectl get pods -n $NAMESPACE -l app=$SERVICE # Resource usage echo "\nš» Resource usage:" kubectl top pods -n $NAMESPACE -l app=$SERVICE # Recent deployments echo "\nš Recent deployments:" kubectl rollout history deployment/$SERVICE -n $NAMESPACE | tail -5 # Prometheus query: Error rate echo "\nš Error rate (last 15m):" curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{service='$SERVICE',status=~'5..'}[5m])" \ | jq -r '.data.result[0].value[1]' # Recent trace (if Tempo available) echo "\nš Recent slow trace:" curl -s "http://tempo:3100/api/search?service=$SERVICE&minDuration=1s&limit=1" \ | jq -r '.traces[0].traceID' echo "\nā Context gathering complete"
Generate postmortem templates automatically from Slack chat logs and timeline.
# generate-postmortem.py import os from datetime import datetime from slack_sdk import WebClient client = WebClient(token=os.environ["SLACK_BOT_TOKEN"]) def generate_postmortem(incident_channel, start_time, resolution_time): # Fetch all messages from incident channel messages = client.conversations_history( channel=incident_channel, oldest=start_time, latest=resolution_time ) timeline = [] actions_taken = [] for msg in messages['messages']: timestamp = datetime.fromtimestamp(float(msg['ts'])) text = msg.get('text', '') user = msg.get('user', 'System') # Extract timeline events if 'alert' in text.lower() or 'incident' in text.lower(): timeline.append(f"- **{timestamp.strftime('%H:%M:%S')}**: {text}") # Extract actions if '/runbook' in text or 'ran' in text.lower(): actions_taken.append(f"- {text}") # Generate markdown postmortem postmortem = f"""# Postmortem: {incident_channel} **Date**: {datetime.now().strftime('%Y-%m-%d')} **Duration**: {(resolution_time - start_time) / 60:.0f} minutes **Severity**: P1 **Status**: Resolved ## Summary [Brief description of what happened] ## Timeline {"".join(timeline)} ## Root Cause [To be filled in by incident commander] ## Actions Taken {"".join(actions_taken)} ## What Went Well - ChatOps commands executed successfully - Auto-remediation contained the issue - Context gathering provided immediate diagnostics ## What Went Wrong [To be filled in] ## Action Items - [ ] [Action item 1] - [ ] [Action item 2] ## Lessons Learned [To be filled in] """ # Save to GitHub with open(f"postmortems/{incident_channel}.md", 'w') as f: f.write(postmortem) return postmortem
āāāāāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāā ā Phase ā Before ā After ā Improvement ā āāāāāāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā⤠ā Alert ā On-call wakes ā 2 min ā 2 min ā - ā ā Find runbook ā 10 min ā 0 min (auto-post)ā -10 min ā ā Understand issue ā 15 min ā 2 min (auto-ctx) ā -13 min ā ā Execute fix ā 15 min ā 3 min (ChatOps) ā -12 min ā ā Verify resolution ā 5 min ā 2 min ā -3 min ā āāāāāāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā⤠ā **Total MTTR** ā **47 min** ā **9 min** ā **-81%** ā āāāāāāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāā With auto-remediation (P3 issues): - Alert ā Auto-fix ā Resolution: 3 minutes - No human intervention required
Start this week: Convert your top 3 runbooks to executable scripts. Set up a Slack bot. Auto-remediate one P3 alert. Watch your MTTR drop.
HostingX IL
Scalable automation & integration platform accelerating modern B2B product teams.
Services
Subscribe to our newsletter
Get monthly email updates about improvements.
Copyright Ā© 2025 HostingX IL. All Rights Reserved.