🗄️ Skill: Database Reliability
📋 Metadata
| Atributo | Valor |
|---|---|
| ID | sre-database-reliability |
| Nivel | 🔴 Avanzado |
| Versión | 1.0.0 |
| Keywords | database, postgresql, mysql, mongodb, replication, backup, failover, connection-pooling |
| Referencia | PostgreSQL High Availability |
🔑 Keywords para Invocación
database-reliabilitydatabase-replicationdatabase-backupfailoverconnection-poolingdatabase-monitoring@skill:database-reliability
Ejemplos de Prompts
Implementa database replication y failover para PostgreSQL
Configura backup y recovery strategy para base de datos
Setup connection pooling y database monitoring
@skill:database-reliability - Alta disponibilidad de base de datos
📖 Descripción
Database reliability es fundamental para servicios confiables. Este skill cubre replication, failover, backups, connection pooling, monitoring, y disaster recovery para bases de datos en producción.
✅ Cuándo Usar Este Skill
- Bases de datos en producción
- Servicios con requisitos de alta disponibilidad
- Datos críticos
- SLAs estrictos
- Compliance requirements
❌ Cuándo NO Usar Este Skill
- Desarrollo local con datos de prueba
- Prototipos sin datos reales
- Sistemas de solo lectura sin disponibilidad crítica
🏗️ Arquitectura de Alta Disponibilidad
┌─────────────────────────────────────────┐
│ Application Layer │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ App │ │ App │ │ App │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
└───────┼────────────┼────────────┼───────┘
│ │ │
└────────────┼────────────┘
│
┌──────▼──────┐
│ Connection │
│ Pool │
└──────┬──────┘
│
┌────────────┼────────────┐
│ │ │
┌────▼────┐ ┌────▼───┐ ┌─────▼────┐
│Primary │ │Replica │ │ Replica │
│(Master) │ │(Read) │ │ (Read) │
└────┬────┘ └────────┘ └──────────┘
│
│ Replication
│
┌────▼────┐
│ Standby │
│ (DR) │
└─────────┘
💻 Implementación
📁 Scripts Ejecutables: Este skill incluye scripts Python ejecutables en la carpeta
scripts/:
postgresql_backup.py- Automatización de backups y restauraciónrequirements.txt- Dependencias (ninguna, usa stdlib)Ver
scripts/README.mdpara documentación de uso.
1. PostgreSQL Replication Setup
1.1 Primary Configuration
# postgresql.conf (Primary)
# Connection Settings
listen_addresses = '*'
max_connections = 200
# Replication Settings
wal_level = replica
max_wal_senders = 3
max_replication_slots = 3
hot_standby = on
# Performance
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 20MB
min_wal_size = 1GB
max_wal_size = 4GB
# Logging
log_destination = 'stderr'
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_rotation_age = 1d
log_rotation_size = 100MB
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
log_timezone = 'UTC'
# pg_hba.conf (Primary)
# Allow replication from replicas
host replication replicator 10.0.0.0/24 md5
1.2 Replica Configuration
# Setup replication (on replica server)
# 1. Stop PostgreSQL
sudo systemctl stop postgresql
# 2. Backup primary
pg_basebackup -h primary-host -D /var/lib/postgresql/data -U replicator -P -W -R -S replication_slot_1
# 3. Configure recovery
# postgresql.conf (Replica)
hot_standby = on
max_standby_streaming_delay = 30s
wal_receiver_status_interval = 10s
hot_standby_feedback = on
# postgresql.auto.conf (Auto-generated)
primary_conninfo = 'host=primary-host port=5432 user=replicator password=xxx application_name=replica1'
primary_slot_name = 'replication_slot_1'
1.3 Replication Slot Management
-- Create replication slot
SELECT pg_create_physical_replication_slot('replication_slot_1');
-- List replication slots
SELECT * FROM pg_replication_slots;
-- Monitor replication lag
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
sync_state,
sync_priority
FROM pg_stat_replication;
-- Check lag in bytes
SELECT
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
client_addr,
application_name
FROM pg_stat_replication;
2. Automatic Failover (Patroni)
# patroni/patroni.yml
scope: postgres-cluster
namespace: /db/
name: postgres-node1
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.1.1:8008
auth: 'username:password'
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 30
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
parameters:
max_connections: 200
max_worker_processes: 8
wal_level: replica
hot_standby: on
wal_keep_segments: 8
max_wal_senders: 3
max_replication_slots: 3
synchronous_commit: 'on'
synchronous_standby_names: 'ANY 1 (postgres-node2,postgres-node3)'
initdb:
- encoding: UTF8
- data-checksums
pg_hba:
- host replication replicator 0.0.0.0/0 md5
- host all all 0.0.0.0/0 md5
users:
admin:
password: admin
options:
- createrole
- createdb
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.1.1:5432
data_dir: /var/lib/postgresql/data
pgpass: /tmp/pgpass
authentication:
replication:
username: replicator
password: replicator
superuser:
username: postgres
password: postgres
parameters:
unix_socket_directories: '/var/run/postgresql'
recovery_conf:
restore_command: 'cp /backup/%f %p'
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
3. Connection Pooling (PgBouncer)
# pgbouncer/pgbouncer.ini
[databases]
mydb = host=primary-host port=5432 dbname=mydb
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
max_db_connections = 100
max_user_connections = 50
server_round_robin = 1
ignore_startup_parameters = extra_float_digits
# Logging
log_connections = 1
log_disconnections = 1
log_pooler_errors = 1
# Admin console
admin_users = admin
stats_users = stats
# Health check
server_check_query = SELECT 1
server_check_delay = 30
server_lifetime = 3600
server_idle_timeout = 600
# Connection limits
max_client_conn = 1000
default_pool_size = 25
min_pool_size = 5
4. Backup Strategy
#!/bin/bash
# scripts/backup.sh
BACKUP_DIR="/backup/postgresql"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=30
# Full backup
pg_basebackup \
-h localhost \
-D "$BACKUP_DIR/full_$DATE" \
-U backup_user \
-Ft \
-z \
-P
# WAL archiving (continuous)
# In postgresql.conf:
# archive_mode = on
# archive_command = 'test ! -f /backup/wal/%f && cp %p /backup/wal/%f'
# Cleanup old backups
find "$BACKUP_DIR" -type d -mtime +$RETENTION_DAYS -exec rm -rf {} \;
# Verify backup
pg_verifybackup -D "$BACKUP_DIR/full_$DATE"
Script ejecutable: scripts/postgresql_backup.py
Script CLI completo para automatizar backups, restauración y limpieza de PostgreSQL.
Cuándo ejecutar
- Backups programados: Como parte de cron jobs o tareas automatizadas
- Restauración de emergencia: Para restaurar bases de datos desde backups
- Limpieza de backups: Para mantener el espacio de almacenamiento bajo control
- Verificación de backups: Para validar la integridad de backups antes de restaurar
Uso como CLI
# Crear backup completo
python scripts/postgresql_backup.py backup \
--host localhost \
--user backup_user \
--backup-dir /backup/postgresql
# Listar backups disponibles
python scripts/postgresql_backup.py list \
--backup-dir /backup/postgresql
# Limpiar backups antiguos (dry-run primero)
python scripts/postgresql_backup.py cleanup \
--backup-dir /backup/postgresql \
--retention-days 30 \
--dry-run
# Ejecutar limpieza real
python scripts/postgresql_backup.py cleanup \
--backup-dir /backup/postgresql \
--retention-days 30
# Restaurar desde backup
python scripts/postgresql_backup.py restore \
--backup-path /backup/postgresql/full_20240115_120000 \
--target-dir /var/lib/postgresql/data
Uso como módulo
from scripts.postgresql_backup import PostgreSQLBackup
# Crear instancia
backup = PostgreSQLBackup(
host="localhost",
port=5432,
user="backup_user",
backup_dir="/backup/postgresql"
)
# Crear backup
backup_path = backup.full_backup(verify=True)
# Limpiar backups antiguos
backup.cleanup_old_backups(retention_days=30)
# Listar backups
backup.list_backups()
Características
- ✅ Backup físico completo con
pg_basebackup - ✅ Verificación automática de backups
- ✅ Restauración automatizada
- ✅ Limpieza de backups antiguos con política de retención
- ✅ Listado de backups disponibles
- ✅ Modo dry-run para limpieza
- ✅ Manejo de errores robusto
5. Database Monitoring
# prometheus/database-metrics.yml
scrape_configs:
- job_name: 'postgres-exporter'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'postgresql'
static_configs:
- targets: ['localhost:9187']
-- Custom metrics queries
-- Connection pool usage
SELECT
datname,
numbackends as active_connections,
max_connections,
(numbackends::float / max_connections * 100) as connection_pct
FROM pg_stat_database;
-- Replication lag
SELECT
EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;
-- Table bloat
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) AS bloat
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;
-- Slow queries
SELECT
pid,
now() - pg_stat_activity.query_start AS duration,
query,
state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
AND state = 'active';
-- Index usage
SELECT
schemaname,
tablename,
indexname,
idx_scan as index_scans,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;
6. Alert Rules
# prometheus/alerts/database-alerts.yml
groups:
- name: database_alerts
interval: 30s
rules:
# High connection usage
- alert: DatabaseHighConnections
expr: |
(pg_stat_database_numbackends / pg_stat_database_max_connections) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High connection usage in {{ $labels.datname }}"
description: "{{ $value | humanizePercentage }} of connections in use"
# Replication lag
- alert: DatabaseReplicationLag
expr: |
pg_replication_lag_seconds > 30
for: 5m
labels:
severity: warning
annotations:
summary: "High replication lag"
description: "Replication lag is {{ $value }} seconds"
# Database down
- alert: DatabaseDown
expr: |
up{job="postgres-exporter"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database exporter is down"
# Slow queries
- alert: DatabaseSlowQueries
expr: |
pg_stat_activity_max_tx_duration > 300
for: 5m
labels:
severity: warning
annotations:
summary: "Slow query detected"
description: "Query running for {{ $value }} seconds"
🎯 Mejores Prácticas
1. Replication
✅ DO:
- Use synchronous replication for critical data
- Monitor replication lag
- Test failover regularly
- Use replication slots
❌ DON'T:
- Ignore replication lag
- Skip failover testing
- Use async replication for critical writes
2. Backups
✅ DO:
- Test restore procedures regularly
- Use point-in-time recovery
- Store backups off-site
- Encrypt backups
❌ DON'T:
- Assume backups work without testing
- Keep all backups in one location
- Skip backup verification
3. Connection Pooling
✅ DO:
- Use connection pooling
- Set appropriate pool sizes
- Monitor pool usage
- Handle connection errors gracefully
❌ DON'T:
- Exceed database connection limits
- Use too large pools
- Ignore pool exhaustion
🚨 Troubleshooting
Replication Lag
- Check network connectivity
- Monitor WAL generation rate
- Check replica performance
- Consider increasing resources
Connection Exhaustion
- Reduce connection pool size
- Close idle connections
- Use connection pooling
- Increase database max_connections
Backup Failures
- Check disk space
- Verify permissions
- Test backup commands manually
- Review backup logs
📚 Recursos Adicionales
Versión: 1.0.0
Última actualización: Diciembre 2025
Total líneas: 1,100+