Disaster recovery — runbook operacional¶
Procedimientos para recuperación ante desastre: pérdida de DB, región caída, corrupción de command log, restauración point-in-time. RPO/RTO targets por fase. Runbooks ejecutables.
Targets de recuperación¶
| Fase | RPO (Recovery Point Objective) | RTO (Recovery Time Objective) | Disponibilidad |
|---|---|---|---|
| M1 (single-node) | 24h (daily backup) | 4h (manual restore) | 99% |
| M2 (Patroni HA) | 5 min (WAL streaming) | 30s (auto-failover) | 99.9% |
| M3 (multi-AZ) | 30s (sync replication) | 60s (auto-failover) | 99.95% |
| M4 (multi-region) | 5 min (async cross-region) | 10 min (manual region failover) | 99.99% |
Definiciones: - RPO: cuánta data podés perder. 5 min = data de últimos 5 min puede irse. - RTO: cuánto tarda en estar back up. 30s = service back en 30s.
Inventario de fallas¶
| Categoría | Frecuencia esperada | Impacto |
|---|---|---|
| Single node crash (engine) | mensual | 30s (HA) |
| Single node crash (Postgres replica) | mensual | 0 (replica) |
| Postgres primary crash | semestral | 30s failover |
| Disk full | trimestral | minutos (manual) |
| Data corruption (logical) | anual | horas |
| AZ outage | anual | minutos (auto) |
| Region outage | quinquenal | horas (manual) |
| Ransomware / accidental delete | raro | horas-días |
Backup strategy¶
Layers¶
flowchart TD
L1[Layer 1: pg_basebackup nightly + WAL continuous]
L1R[retention 7 days local]
L1P[recovery point: cualquier segundo de los últimos 7 días]
L1 --> L1R --> L1P
L2[Layer 2: pgBackRest weekly full + diff daily + WAL]
L2R[S3 / GCS cross-region]
L2RT[retention 90 días]
L2P[recovery point: cualquier segundo de los últimos 90 días]
L2 --> L2R --> L2RT --> L2P
L3[Layer 3: Logical dumps pg_dump monthly]
L3R[S3 immutable bucket - compliance]
L3RT[retention 7 años]
L3P[recovery point: snapshot mensual]
L3 --> L3R --> L3RT --> L3P
Configuración (pgBackRest)¶
# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-type=s3
repo1-s3-bucket=wf-backups-prod
repo1-s3-endpoint=s3.amazonaws.com
repo1-s3-region=us-west-2
repo1-s3-key=...
repo1-s3-key-secret=...
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=<random>
repo1-retention-full=4 # 4 weekly fulls
repo1-retention-diff=14 # 14 daily diffs
[main]
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432
Cronjobs:
# Sunday: full backup
0 2 * * 0 postgres pgbackrest --stanza=main --type=full backup
# Mon-Sat: differential
0 2 * * 1-6 postgres pgbackrest --stanza=main --type=diff backup
# Every 15min: WAL push (configurado via archive_command)
Test de backup (mensual obligatorio)¶
Un backup que no se prueba no existe.
# Restore test en sandbox
pgbackrest --stanza=main --type=time --target='2026-05-14 12:00:00' restore
psql -c "SELECT COUNT(*) FROM process_instances WHERE created_at > '2026-05-14 11:55:00'"
# Verificar consistencia
Scenario 1: Engine node crash (M2+)¶
Síntomas: 1 nodo engine no responde, otros sanos.
Detección:
- Healthcheck /readyz falla.
- Métrica wf_engine_node_up == 0.
- Alert: up{job="wf-engine"} == 0.
Recovery (auto): 1. K8s detecta liveness probe failed → reschedule pod. 2. Load balancer remueve nodo del pool. 3. Nuevo pod arranca, reconectaa Postgres, recibe traffic. 4. RTO esperado: 30-60s.
Runbook manual (si auto falla):
# 1. Verificar estado
kubectl get pods -n wf-engine
kubectl logs <failing-pod> --tail 100
# 2. Si pod en CrashLoop:
kubectl describe pod <failing-pod>
# - OOMKilled? → aumentar memory limit
# - DB connection refused? → check Postgres status
# - Migration mismatch? → rollback to previous version
# 3. Force restart si necesario
kubectl delete pod <failing-pod>
Scenario 2: Postgres primary failure (M2+)¶
Síntomas: writes fallan, replica leader-election en curso.
Detección:
- Patroni emite event failover_initiated.
- Métrica pg_replication_lag_bytes salta.
- Alert: patroni_master_count != 1.
Recovery (auto con Patroni): 1. Patroni detecta primary down (~10s timeout). 2. Elige nuevo primary (replica con menor lag). 3. Promueve replica → primary. 4. Otras replicas se reapuntan al nuevo primary. 5. HAProxy / pgbouncer reapunta clients. 6. Engine reabre connections (auto-reconnect del driver). 7. RTO: 30-60s.
Runbook manual:
# 1. Confirmar estado con Patroni
patronictl -c /etc/patroni/patroni.yml list
# Output: ver leader actual
# 2. Si no hay failover automático:
patronictl -c /etc/patroni/patroni.yml failover --master <old> --candidate <new>
# 3. Verificar engine reconecta
curl http://wf-engine:8080/healthz
# Esperar "ok" después de reconnect
# 4. Investigar old primary
# - Disk full? df -h
# - OOM? dmesg | tail
# - Network partition? ip a
# Una vez reparado, reincorporar como replica:
patronictl -c /etc/patroni/patroni.yml reinit <old-name>
Scenario 3: Logical corruption (data inconsistency)¶
Síntomas: queries devuelven resultados raros, foreign key violations.
Detección:
- Constraint violations en logs.
- Métrica wf_engine_consistency_errors_total > 0.
- Replay determinism check falla.
Diagnóstico:
-- Verificar integridad
SELECT COUNT(*) FROM jobs WHERE process_instance_key NOT IN (SELECT key FROM process_instances);
-- Si > 0: orphan jobs
SELECT COUNT(*) FROM element_instances WHERE process_instance_key IS NULL;
-- Si > 0: orphan elements
-- Hash check del command log
SELECT md5(string_agg(command_data::text, '' ORDER BY position))
FROM commands
WHERE position BETWEEN <start> AND <end>;
-- Compare con hash esperado
Recovery:
# Option A: Point-in-time restore (RPO depende del momento)
pgbackrest --stanza=main --type=time --target='2026-05-14 11:30:00' restore
# Option B: Reconstruir state desde command log
wf replay start --from-position 0 --to-position <current>
wf replay verify # check hashes match
# Option C: Manual reparación (último recurso)
BEGIN;
DELETE FROM jobs WHERE process_instance_key NOT IN (SELECT key FROM process_instances);
COMMIT;
Scenario 4: Disk full¶
Síntomas: writes fallan con ENOSPC, engine empieza a rechazar comandos.
Detección:
- pg_database_size > threshold.
- node_filesystem_avail_bytes < 10%.
- Alert: wf_db_disk_usage_percent > 85.
Recovery inmediato:
# 1. Verificar
df -h /var/lib/postgresql
psql -c "SELECT pg_size_pretty(pg_database_size('wfengine'))"
# 2. Quick wins
# Vacuum si bloat alto
psql -c "VACUUM FULL command_log" # CUIDADO: lock pesado
# 3. Drop partitions viejas
psql -c "DROP TABLE commands_2025_01" # antiguas
# 4. WAL files acumulados?
ls /var/lib/postgresql/16/main/pg_wal | wc -l
# Si miles: replica stuck? backup stuck?
# 5. Si nada funciona: expand storage
# k8s: kubectl edit pvc data-postgres-0 → storage: +500Gi
# AWS: aws rds modify-db-instance --allocated-storage ...
Prevención: - Alerta a 70% para acción no urgente. - Alerta a 85% para acción inmediata. - Alerta a 95% page on-call. - Auto-grow PVC (k8s) con headroom.
Scenario 5: AZ outage (M3+)¶
Síntomas: 1 AZ entera inaccessible.
Detección: - Múltiples nodes en misma AZ down. - Cloud provider status page.
Recovery (auto si configurado): 1. K8s reschedule pods a otras AZ. 2. Postgres failover (si primary estaba en AZ caída). 3. Load balancer health checks remueven nodos. 4. RTO: 60s.
Validación:
# Confirmar capacity en otras AZ
kubectl get nodes -l topology.kubernetes.io/zone=us-west-2a,us-west-2b
# Confirmar Postgres primary en otra AZ
patronictl list
# Confirmar throughput recovered
curl http://prometheus/api/v1/query?query=rate(wf_engine_commands_processed_total[5m])
Scenario 6: Region outage (M4+)¶
Síntomas: región entera unreachable.
Detección: - Multi-region monitoring (DNS-based). - Provider status page.
Recovery (manual, RTO 10min):
# 1. Promover region secundaria a primary
# - Verificar lag
psql -h secondary-region -c "SELECT pg_last_wal_replay_lsn(), pg_last_wal_receive_lsn()"
# Si lag aceptable, promote
psql -h secondary-region -c "SELECT pg_promote()"
# 2. Update DNS / GLB
# Route 53 / Cloudflare update health check
aws route53 change-resource-record-sets ...
# 3. Apuntar workers a nueva region
# Workers via service discovery se reapuntan
kubectl rollout restart deploy/workers -n wf-workers
# 4. Comunicar a usuarios
# Status page update, customer comms
# 5. Cuando region primary recovers:
# - Reincorporar como secundario (no auto-failback)
# - Verificar consistency con replay verify
Scenario 7: Ransomware / accidental DELETE¶
Síntomas: data faltante / cifrada.
Detección: - Audit log: actor inesperado. - Métricas: row count cayó. - Application errors: "not found".
Recovery:
# 1. AISLAR: cortar acceso al sistema comprometido
kubectl scale deploy/wf-engine --replicas=0
# 2. PRESERVAR: snapshot del estado actual ANTES de cualquier cambio
pgbackrest --stanza=main backup --type=full
# Para forensics
# 3. RESTAURAR desde último backup limpio (pre-incident)
pgbackrest --stanza=main --type=time --target='2026-05-13 08:00:00' restore
# (Antes del momento del ataque)
# 4. VERIFICAR
psql -c "SELECT COUNT(*) FROM process_instances WHERE created_at > '2026-05-13 08:00:00'"
# Cuántos perdidos
# 5. RECONSTRUIR data perdida si posible
# - Re-ingest desde upstream events
# - Customer comms para los procesos perdidos
# 6. POSTMORTEM
# - Causa raíz
# - Audit trail
# - Hardening
Prevención: - Immutable backups (S3 Object Lock). - Audit log con tamper-detection. - Least-privilege RBAC. - Network segmentation. - Backup encryption keys separadas del engine.
Drills (chaos engineering)¶
Disciplina: ejecutar drills periódicos para validar runbooks.
Quarterly drill schedule¶
Q1: Pod crash + auto-recovery (M2+)
Q2: Postgres failover (M2+)
Q3: AZ outage simulation (M3+)
Q4: Full region failover (M4+)
Annual: Backup restore from S3 (cross-region)
Chaos tools¶
- chaos-mesh o Litmus: k8s-native chaos.
- AWS Fault Injection Simulator: AZ outage simulation.
- Patroni manual failover: drill para Postgres.
# chaos-mesh: kill random engine pod
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-test
spec:
action: pod-failure
mode: random-max-percent
value: "30"
duration: "1m"
selector:
labelSelectors:
app: wf-engine
Drill report template¶
## Drill: <name> — <date>
### Setup
- Cluster: prod-east-1
- Version: v1.2.3
- Date: 2026-05-14
- Conductor: paulo
### Scenario
<description>
### Timeline
- T+0: trigger applied
- T+12s: alert fired (PagerDuty)
- T+25s: on-call ack
- T+90s: service recovered
### Metrics
- RTO actual: 90s (target 60s) ⚠️
- RPO actual: 0 (target 30s) ✅
- p99 latency during incident: 450ms ⚠️
### Findings
- Alert delay ~12s (mejorable)
- Engine reconnect took 30s (esperado <10s)
### Actions
- [ ] Tune alert threshold
- [ ] Investigate engine reconnect logic
Observability for DR¶
wf_dr_backup_age_seconds # tiempo desde último backup
wf_dr_backup_size_bytes
wf_dr_wal_lag_bytes # primary vs replicas
wf_dr_replication_lag_seconds
wf_dr_failover_total{from, to}
wf_dr_restore_duration_seconds{type}
Alertas críticas:
- wf_dr_backup_age_seconds > 86400 → backup stuck.
- wf_dr_replication_lag_seconds > 30 → replica falling behind.
- wf_dr_wal_lag_bytes > 10GB → catch-up imposible.
Audit / compliance¶
DR procedures deben quedar documentados para auditorías SOC 2 / ISO 27001:
- Runbooks versionados en git.
- Drill reports archivados.
- Backup verification reports.
- Access logs a backups.
Ver analysis/compliance-roadmap.
Checklist post-incident¶
- Service restored to healthy state.
- Customer comms sent.
- Post-mortem scheduled (within 48h).
- Root cause documented.
- Action items tracked.
- Runbook updated if gaps found.
- Monitoring/alerting updated.
- Drill scheduled to verify fix.
Referencias¶
- analysis/scaling-strategy-postgres — phases que afectan RPO/RTO
- analysis/failure-mode-analysis — FMEA
- adrs/adr-020-patroni-postgres-ha — HA setup
- concepts/postgres-monitoring — métricas DB
- pgBackRest docs
- Patroni docs
- SRE Book - Managing Incidents