🚨 Skill: Alerting & Incident Management

📋 Metadata

Atributo	Valor
ID	`sre-alerting-incident-management`
Nivel	🔴 Avanzado
Versión	1.0.0
Keywords	`alerting`, `incident-management`, `pagerduty`, `opsgenie`, `oncall`, `runbooks`
Referencia	Google SRE - On-Call, PagerDuty

🔑 Keywords para Invocación

alerting
incident-management
pagerduty
opsgenie
oncall
runbooks
incident-response
@skill:alerting

Ejemplos de Prompts

Implementa alerting con PagerDuty y runbooks

Configura incident management y on-call rotation

Configura runbooks y on-call rotation

@skill:alerting - Sistema completo de alertas e incidentes

📖 Descripción

Alerting efectivo y gestión de incidentes es crítico para mantener servicios confiables. Este skill cubre diseño de alertas efectivas, on-call rotations, runbooks, e incident response workflows.

✅ Cuándo Usar Este Skill

Servicios en producción
Equipos on-call
SLAs críticos
Sistemas distribuidos
Incidentes frecuentes
Compliance requirements

❌ Cuándo NO Usar Este Skill

Desarrollo local solo
Servicios no críticos sin SLA
Equipos sin capacidad on-call

🏗️ Arquitectura de Alerting

┌──────────────┐
│  Prometheus  │
│  (Metrics)   │
└──────┬───────┘
       │
       │ Alert Rules
       │
┌──────▼───────┐
│ Alertmanager │
└──────┬───────┘
       │
       ├──────────┬──────────┬──────────┐
       │          │          │          │
┌──────▼───┐ ┌───▼────┐ ┌───▼────┐ ┌──▼─────┐
│ PagerDuty│ │ Slack  │ │ Email  │ │ Webhook│
└──────────┘ └────────┘ └────────┘ └────────┘
       │
       │
┌──────▼──────────────┐
│ On-Call Engineer    │
│ (Incident Response) │
└─────────────────────┘

💻 Implementación

1. Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
  # Critical alerts → PagerDuty immediately
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
    continue: true

  # Warning alerts → Slack only
  - match:
      severity: warning
    receiver: 'slack-warnings'
    repeat_interval: 24h

  # Service-specific routes
  - match:
      service: payment-service
    receiver: 'payment-team'
    group_wait: 5s

inhibit_rules:
  # Inhibit warning if critical is firing
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}'
        severity: 'critical'
        client: 'Prometheus'
        client_url: 'http://prometheus:9090'

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warnings'
        title: 'Warning: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'payment-team'
    pagerduty_configs:
      - service_key: 'PAYMENT_TEAM_KEY'
        description: 'Payment Service Alert'
    slack_configs:
      - channel: '#payment-team'

2. Alert Rules (Prometheus)

# prometheus/alerts/service-alerts.yml
groups:
  - name: service_alerts
    interval: 30s
    rules:
      # Error Budget Exhaustion
      - alert: ErrorBudgetExhaustion
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
            /
            sum(rate(http_requests_total[5m])) by (service)
          ) > 0.001
          and
          (
            sum_over_time(
              (
                sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
                /
                sum(rate(http_requests_total[5m])) by (service)
              )[28d:]
            ) > 0.001
          )
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "Error budget exhausted for {{ $labels.service }}"
          description: |
            Service {{ $labels.service }} has exceeded error budget threshold.
            Current error rate: {{ $value | humanizePercentage }}
            Runbook: https://runbooks.example.com/error-budget

      # High Latency
      - alert: HighLatencyP99
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 1.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High P99 latency in {{ $labels.service }}"
          description: "P99 latency is {{ $value }}s (threshold: 1s)"
          runbook: https://runbooks.example.com/high-latency

      # Service Down
      - alert: ServiceDown
        expr: up{job=~"app-.*"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"
          description: "Service has been unreachable for more than 2 minutes"
          runbook: https://runbooks.example.com/service-down

      # High Memory Usage
      - alert: HighMemoryUsage
        expr: |
          (
            container_memory_usage_bytes{pod=~".+"}
            /
            container_spec_memory_limit_bytes{pod=~".+"}
          ) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage in {{ $labels.pod }}"
          description: "Memory usage is {{ $value | humanizePercentage }}"
          runbook: https://runbooks.example.com/high-memory

      # Disk Space
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Only {{ $value | humanizePercentage }} disk space remaining"

      # CPU Throttling
      - alert: CPUThrottling
        expr: |
          rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU throttling detected in {{ $labels.pod }}"
          description: "Container is being CPU throttled"

      # Connection Pool Exhaustion
      - alert: ConnectionPoolExhaustion
        expr: |
          (
            db_connections_active{service=~".+"}
            /
            db_connections_max{service=~".+"}
          ) > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Connection pool near exhaustion in {{ $labels.service }}"
          description: "{{ $value | humanizePercentage }} of connections in use"

3. Runbook Template

# Runbook: High Error Rate

## Alert Name
`HighErrorRate`

## Severity
Critical

## Description
Service error rate exceeds threshold (5% for 5 minutes)

## Symptoms
- High HTTP 5xx response rate
- User complaints
- Error logs increasing

## Immediate Actions

1. **Acknowledge Alert**
   - Acknowledge in PagerDuty/OpsGenie
   - Notify team in Slack

2. **Check Service Health**
   ```bash
   kubectl get pods -l app=service-name
   kubectl logs -l app=service-name --tail=100

Check Metrics
- Grafana dashboard: /d/service-overview
- Error rate graph
- Latency graphs
Identify Root Cause
- Check recent deployments
- Review error logs
- Check dependencies (DB, APIs)

Resolution Steps

If caused by code deployment:

# Rollback deployment
kubectl rollout undo deployment/service-name

If caused by resource exhaustion:

# Scale up
kubectl scale deployment/service-name --replicas=5

If caused by dependency failure:

Check dependency service status
Implement circuit breaker
Enable fallback mechanisms

Post-Incident

Document in incident log
Update runbook if needed

Escalation

If not resolved in 15 min → Escalate to senior engineer
If not resolved in 30 min → Escalate to engineering manager
If service completely down → Escalate to CTO


### 4. On-Call Rotation (PagerDuty)

```yaml
# pagerduty/escalation-policies.yml
escalation_policies:
  - name: "Primary On-Call"
    description: "Primary escalation for production services"
    num_loops: 3
    escalation_rules:
      - escalation_delay_in_minutes: 0
        targets:
          - type: "user"
            id: "PXXXXXXXX"  # Primary on-call
      - escalation_delay_in_minutes: 15
        targets:
          - type: "user"
            id: "PYYYYYYYY"  # Secondary on-call
      - escalation_delay_in_minutes: 30
        targets:
          - type: "schedule"
            id: "PZZZZZZZZ"  # Manager on-call

  - name: "Critical Alerts Only"
    escalation_rules:
      - escalation_delay_in_minutes: 0
        targets:
          - type: "user_reference"
            id: "PXXXXXXXX"
      - escalation_delay_in_minutes: 10
        targets:
          - type: "escalation_policy_reference"
            id: "EPXXXXXXX"  # Manager escalation

schedules:
  - name: "Primary On-Call Schedule"
    time_zone: "America/New_York"
    layers:
      - name: "Layer 1"
        start: "2024-01-01T00:00:00"
        rotation_virtual_start: "2024-01-01T00:00:00"
        rotation_turn_length_seconds: 604800  # 1 week
        users:
          - user_id: "PXXXXXXXX"
          - user_id: "PYYYYYYYY"
          - user_id: "PZZZZZZZZ"
        restrictions:
          - type: "daily_restriction"
            start_time_of_day: "09:00:00"
            duration_seconds: 32400  # 9 hours

5. Incident Response Workflow

# incident_response/workflow.py
from dataclasses import dataclass
from datetime import datetime
from typing import List, Optional
from enum import Enum

class IncidentSeverity(Enum):
    SEV1 = "critical"  # Service down
    SEV2 = "high"      # Major degradation
    SEV3 = "medium"    # Minor issues
    SEV4 = "low"       # Informational

@dataclass
class Incident:
    id: str
    title: str
    severity: IncidentSeverity
    status: str  # open, investigating, mitigated, resolved
    created_at: datetime
    resolved_at: Optional[datetime]
    assigned_to: str
    affected_services: List[str]
    description: str
    root_cause: Optional[str] = None
    resolution: Optional[str] = None

class IncidentResponseWorkflow:
    def __init__(self):
        self.incidents = []

    def create_incident(
        self,
        title: str,
        severity: IncidentSeverity,
        affected_services: List[str],
        description: str
    ) -> Incident:
        incident = Incident(
            id=f"INC-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
            title=title,
            severity=severity,
            status="open",
            created_at=datetime.now(),
            resolved_at=None,
            assigned_to=self._assign_oncall(),
            affected_services=affected_services,
            description=description
        )
        
        self.incidents.append(incident)
        self._notify_team(incident)
        self._create_incident_channel(incident)
        
        return incident

    def _assign_oncall(self) -> str:
        # Logic to assign to current on-call engineer
        return "engineer@example.com"

    def _notify_team(self, incident: Incident):
        # Send notifications via PagerDuty, Slack, etc.
        pass

    def _create_incident_channel(self, incident: Incident):
        # Create dedicated Slack channel for incident
        channel_name = f"incident-{incident.id.lower()}"
        # Create channel logic
        pass

    def update_status(self, incident_id: str, status: str, notes: str):
        incident = self._find_incident(incident_id)
        if incident:
            incident.status = status
            if status == "resolved":
                incident.resolved_at = datetime.now()

    def _find_incident(self, incident_id: str) -> Optional[Incident]:
        return next((i for i in self.incidents if i.id == incident_id), None)

6. Alert Fatigue Prevention

# alertmanager/alert-fatigue-prevention.yml
# Strategies to prevent alert fatigue

# 1. Alert Grouping
route:
  group_by: ['alertname', 'service']
  group_wait: 10s      # Wait before sending initial notification
  group_interval: 5m   # Wait before sending updated notification
  repeat_interval: 12h # Minimum time between notifications

# 2. Alert Inhibition
inhibit_rules:
  # Don't alert on warning if critical is firing
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['service']

  # Don't alert on individual instance if all instances are down
  - source_match:
      alertname: 'AllInstancesDown'
    target_match_re:
      alertname: '.*InstanceDown'

# 3. Alert Suppression (Silence Rules)
# Use Alertmanager UI or API to create silences for known issues

# 4. Threshold Tuning
# Use error budgets and SLOs to set meaningful thresholds
# Example: Alert only when error budget is at risk

# 5. Alert Classification
# Only page on actionable alerts
# Use different channels for different severities

🎯 Mejores Prácticas

1. Alert Design

✅ DO:

Alert on symptoms, not causes
Make alerts actionable
Include runbook links
Use appropriate severity levels
Test alerts regularly

❌ DON'T:

Alert on every metric
Create alerts that require investigation to understand
Alert on things you can't fix
Duplicate alerts across systems

2. On-Call

✅ DO:

Maintain clear rotation schedules
Provide context in handoffs
Limit on-call duration (max 1 week)
Compensate on-call time
Track on-call load

❌ DON'T:

Have people on-call 24/7
Make on-call mandatory without compensation
Skip handoffs between rotations

3. Incident Response

✅ DO:

Follow runbooks
Communicate frequently
Document decisions
Implement action items

❌ DON'T:

Skip incident documentation
Point fingers
Ignore post-mortem action items

🚨 Troubleshooting

Too Many Alerts

Review alert rules
Increase thresholds where appropriate
Implement better grouping
Use alert inhibition

Alerts Not Firing

Check Prometheus query syntax
Verify alert rule evaluation
Check Alertmanager configuration
Verify notification channel configs

On-Call Burnout

Review alert volume
Reduce non-actionable alerts
Improve runbooks
Rotate more frequently

📚 Recursos Adicionales

Versión: 1.0.0
Última actualización: Diciembre 2025
Total líneas: 1,100+

ナビゲーション

Skillsとは？

リンク

🚨 Skill: Alerting & Incident Management

🚨 Skill: Alerting & Incident Management

📋 Metadata

🔑 Keywords para Invocación

Ejemplos de Prompts

📖 Descripción

✅ Cuándo Usar Este Skill

❌ Cuándo NO Usar Este Skill

🏗️ Arquitectura de Alerting

💻 Implementación

1. Alertmanager Configuration

2. Alert Rules (Prometheus)

3. Runbook Template

Resolution Steps

If caused by code deployment:

If caused by resource exhaustion:

If caused by dependency failure:

Post-Incident

Escalation

5. Incident Response Workflow

6. Alert Fatigue Prevention

🎯 Mejores Prácticas

1. Alert Design

2. On-Call

3. Incident Response

🚨 Troubleshooting

Too Many Alerts

Alerts Not Firing

On-Call Burnout

📚 Recursos Adicionales

関連スキル(🌐 Web開発)