🚨 Skill: Alerting & Incident Management
📋 Metadata
| Atributo | Valor |
|---|---|
| ID | sre-alerting-incident-management |
| Nivel | 🔴 Avanzado |
| Versión | 1.0.0 |
| Keywords | alerting, incident-management, pagerduty, opsgenie, oncall, runbooks |
| Referencia | Google SRE - On-Call, PagerDuty |
🔑 Keywords para Invocación
alertingincident-managementpagerdutyopsgenieoncallrunbooksincident-response@skill:alerting
Ejemplos de Prompts
Implementa alerting con PagerDuty y runbooks
Configura incident management y on-call rotation
Configura runbooks y on-call rotation
@skill:alerting - Sistema completo de alertas e incidentes
📖 Descripción
Alerting efectivo y gestión de incidentes es crítico para mantener servicios confiables. Este skill cubre diseño de alertas efectivas, on-call rotations, runbooks, e incident response workflows.
✅ Cuándo Usar Este Skill
- Servicios en producción
- Equipos on-call
- SLAs críticos
- Sistemas distribuidos
- Incidentes frecuentes
- Compliance requirements
❌ Cuándo NO Usar Este Skill
- Desarrollo local solo
- Servicios no críticos sin SLA
- Equipos sin capacidad on-call
🏗️ Arquitectura de Alerting
┌──────────────┐
│ Prometheus │
│ (Metrics) │
└──────┬───────┘
│
│ Alert Rules
│
┌──────▼───────┐
│ Alertmanager │
└──────┬───────┘
│
├──────────┬──────────┬──────────┐
│ │ │ │
┌──────▼───┐ ┌───▼────┐ ┌───▼────┐ ┌──▼─────┐
│ PagerDuty│ │ Slack │ │ Email │ │ Webhook│
└──────────┘ └────────┘ └────────┘ └────────┘
│
│
┌──────▼──────────────┐
│ On-Call Engineer │
│ (Incident Response) │
└─────────────────────┘
💻 Implementación
1. Alertmanager Configuration
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
# Critical alerts → PagerDuty immediately
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
# Warning alerts → Slack only
- match:
severity: warning
receiver: 'slack-warnings'
repeat_interval: 24h
# Service-specific routes
- match:
service: payment-service
receiver: 'payment-team'
group_wait: 5s
inhibit_rules:
# Inhibit warning if critical is firing
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}'
severity: 'critical'
client: 'Prometheus'
client_url: 'http://prometheus:9090'
- name: 'slack-warnings'
slack_configs:
- channel: '#alerts-warnings'
title: 'Warning: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'payment-team'
pagerduty_configs:
- service_key: 'PAYMENT_TEAM_KEY'
description: 'Payment Service Alert'
slack_configs:
- channel: '#payment-team'
2. Alert Rules (Prometheus)
# prometheus/alerts/service-alerts.yml
groups:
- name: service_alerts
interval: 30s
rules:
# Error Budget Exhaustion
- alert: ErrorBudgetExhaustion
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
) > 0.001
and
(
sum_over_time(
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
)[28d:]
) > 0.001
)
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "Error budget exhausted for {{ $labels.service }}"
description: |
Service {{ $labels.service }} has exceeded error budget threshold.
Current error rate: {{ $value | humanizePercentage }}
Runbook: https://runbooks.example.com/error-budget
# High Latency
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "High P99 latency in {{ $labels.service }}"
description: "P99 latency is {{ $value }}s (threshold: 1s)"
runbook: https://runbooks.example.com/high-latency
# Service Down
- alert: ServiceDown
expr: up{job=~"app-.*"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
description: "Service has been unreachable for more than 2 minutes"
runbook: https://runbooks.example.com/service-down
# High Memory Usage
- alert: HighMemoryUsage
expr: |
(
container_memory_usage_bytes{pod=~".+"}
/
container_spec_memory_limit_bytes{pod=~".+"}
) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage in {{ $labels.pod }}"
description: "Memory usage is {{ $value | humanizePercentage }}"
runbook: https://runbooks.example.com/high-memory
# Disk Space
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Only {{ $value | humanizePercentage }} disk space remaining"
# CPU Throttling
- alert: CPUThrottling
expr: |
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "CPU throttling detected in {{ $labels.pod }}"
description: "Container is being CPU throttled"
# Connection Pool Exhaustion
- alert: ConnectionPoolExhaustion
expr: |
(
db_connections_active{service=~".+"}
/
db_connections_max{service=~".+"}
) > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "Connection pool near exhaustion in {{ $labels.service }}"
description: "{{ $value | humanizePercentage }} of connections in use"
3. Runbook Template
# Runbook: High Error Rate
## Alert Name
`HighErrorRate`
## Severity
Critical
## Description
Service error rate exceeds threshold (5% for 5 minutes)
## Symptoms
- High HTTP 5xx response rate
- User complaints
- Error logs increasing
## Immediate Actions
1. **Acknowledge Alert**
- Acknowledge in PagerDuty/OpsGenie
- Notify team in Slack
2. **Check Service Health**
```bash
kubectl get pods -l app=service-name
kubectl logs -l app=service-name --tail=100
-
Check Metrics
- Grafana dashboard:
/d/service-overview - Error rate graph
- Latency graphs
- Grafana dashboard:
-
Identify Root Cause
- Check recent deployments
- Review error logs
- Check dependencies (DB, APIs)
Resolution Steps
If caused by code deployment:
# Rollback deployment
kubectl rollout undo deployment/service-name
If caused by resource exhaustion:
# Scale up
kubectl scale deployment/service-name --replicas=5
If caused by dependency failure:
- Check dependency service status
- Implement circuit breaker
- Enable fallback mechanisms
Post-Incident
- Document in incident log
- Update runbook if needed
Escalation
- If not resolved in 15 min → Escalate to senior engineer
- If not resolved in 30 min → Escalate to engineering manager
- If service completely down → Escalate to CTO
### 4. On-Call Rotation (PagerDuty)
```yaml
# pagerduty/escalation-policies.yml
escalation_policies:
- name: "Primary On-Call"
description: "Primary escalation for production services"
num_loops: 3
escalation_rules:
- escalation_delay_in_minutes: 0
targets:
- type: "user"
id: "PXXXXXXXX" # Primary on-call
- escalation_delay_in_minutes: 15
targets:
- type: "user"
id: "PYYYYYYYY" # Secondary on-call
- escalation_delay_in_minutes: 30
targets:
- type: "schedule"
id: "PZZZZZZZZ" # Manager on-call
- name: "Critical Alerts Only"
escalation_rules:
- escalation_delay_in_minutes: 0
targets:
- type: "user_reference"
id: "PXXXXXXXX"
- escalation_delay_in_minutes: 10
targets:
- type: "escalation_policy_reference"
id: "EPXXXXXXX" # Manager escalation
schedules:
- name: "Primary On-Call Schedule"
time_zone: "America/New_York"
layers:
- name: "Layer 1"
start: "2024-01-01T00:00:00"
rotation_virtual_start: "2024-01-01T00:00:00"
rotation_turn_length_seconds: 604800 # 1 week
users:
- user_id: "PXXXXXXXX"
- user_id: "PYYYYYYYY"
- user_id: "PZZZZZZZZ"
restrictions:
- type: "daily_restriction"
start_time_of_day: "09:00:00"
duration_seconds: 32400 # 9 hours
5. Incident Response Workflow
# incident_response/workflow.py
from dataclasses import dataclass
from datetime import datetime
from typing import List, Optional
from enum import Enum
class IncidentSeverity(Enum):
SEV1 = "critical" # Service down
SEV2 = "high" # Major degradation
SEV3 = "medium" # Minor issues
SEV4 = "low" # Informational
@dataclass
class Incident:
id: str
title: str
severity: IncidentSeverity
status: str # open, investigating, mitigated, resolved
created_at: datetime
resolved_at: Optional[datetime]
assigned_to: str
affected_services: List[str]
description: str
root_cause: Optional[str] = None
resolution: Optional[str] = None
class IncidentResponseWorkflow:
def __init__(self):
self.incidents = []
def create_incident(
self,
title: str,
severity: IncidentSeverity,
affected_services: List[str],
description: str
) -> Incident:
incident = Incident(
id=f"INC-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
title=title,
severity=severity,
status="open",
created_at=datetime.now(),
resolved_at=None,
assigned_to=self._assign_oncall(),
affected_services=affected_services,
description=description
)
self.incidents.append(incident)
self._notify_team(incident)
self._create_incident_channel(incident)
return incident
def _assign_oncall(self) -> str:
# Logic to assign to current on-call engineer
return "engineer@example.com"
def _notify_team(self, incident: Incident):
# Send notifications via PagerDuty, Slack, etc.
pass
def _create_incident_channel(self, incident: Incident):
# Create dedicated Slack channel for incident
channel_name = f"incident-{incident.id.lower()}"
# Create channel logic
pass
def update_status(self, incident_id: str, status: str, notes: str):
incident = self._find_incident(incident_id)
if incident:
incident.status = status
if status == "resolved":
incident.resolved_at = datetime.now()
def _find_incident(self, incident_id: str) -> Optional[Incident]:
return next((i for i in self.incidents if i.id == incident_id), None)
6. Alert Fatigue Prevention
# alertmanager/alert-fatigue-prevention.yml
# Strategies to prevent alert fatigue
# 1. Alert Grouping
route:
group_by: ['alertname', 'service']
group_wait: 10s # Wait before sending initial notification
group_interval: 5m # Wait before sending updated notification
repeat_interval: 12h # Minimum time between notifications
# 2. Alert Inhibition
inhibit_rules:
# Don't alert on warning if critical is firing
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['service']
# Don't alert on individual instance if all instances are down
- source_match:
alertname: 'AllInstancesDown'
target_match_re:
alertname: '.*InstanceDown'
# 3. Alert Suppression (Silence Rules)
# Use Alertmanager UI or API to create silences for known issues
# 4. Threshold Tuning
# Use error budgets and SLOs to set meaningful thresholds
# Example: Alert only when error budget is at risk
# 5. Alert Classification
# Only page on actionable alerts
# Use different channels for different severities
🎯 Mejores Prácticas
1. Alert Design
✅ DO:
- Alert on symptoms, not causes
- Make alerts actionable
- Include runbook links
- Use appropriate severity levels
- Test alerts regularly
❌ DON'T:
- Alert on every metric
- Create alerts that require investigation to understand
- Alert on things you can't fix
- Duplicate alerts across systems
2. On-Call
✅ DO:
- Maintain clear rotation schedules
- Provide context in handoffs
- Limit on-call duration (max 1 week)
- Compensate on-call time
- Track on-call load
❌ DON'T:
- Have people on-call 24/7
- Make on-call mandatory without compensation
- Skip handoffs between rotations
3. Incident Response
✅ DO:
- Follow runbooks
- Communicate frequently
- Document decisions
- Implement action items
❌ DON'T:
- Skip incident documentation
- Point fingers
- Ignore post-mortem action items
🚨 Troubleshooting
Too Many Alerts
- Review alert rules
- Increase thresholds where appropriate
- Implement better grouping
- Use alert inhibition
Alerts Not Firing
- Check Prometheus query syntax
- Verify alert rule evaluation
- Check Alertmanager configuration
- Verify notification channel configs
On-Call Burnout
- Review alert volume
- Reduce non-actionable alerts
- Improve runbooks
- Rotate more frequently
📚 Recursos Adicionales
Versión: 1.0.0
Última actualización: Diciembre 2025
Total líneas: 1,100+