Saltar a contenido

Disaster recovery — runbook operacional

Procedimientos para recuperación ante desastre: pérdida de DB, región caída, corrupción de command log, restauración point-in-time. RPO/RTO targets por fase. Runbooks ejecutables.

Targets de recuperación

Fase RPO (Recovery Point Objective) RTO (Recovery Time Objective) Disponibilidad
M1 (single-node) 24h (daily backup) 4h (manual restore) 99%
M2 (Patroni HA) 5 min (WAL streaming) 30s (auto-failover) 99.9%
M3 (multi-AZ) 30s (sync replication) 60s (auto-failover) 99.95%
M4 (multi-region) 5 min (async cross-region) 10 min (manual region failover) 99.99%

Definiciones: - RPO: cuánta data podés perder. 5 min = data de últimos 5 min puede irse. - RTO: cuánto tarda en estar back up. 30s = service back en 30s.

Inventario de fallas

Categoría Frecuencia esperada Impacto
Single node crash (engine) mensual 30s (HA)
Single node crash (Postgres replica) mensual 0 (replica)
Postgres primary crash semestral 30s failover
Disk full trimestral minutos (manual)
Data corruption (logical) anual horas
AZ outage anual minutos (auto)
Region outage quinquenal horas (manual)
Ransomware / accidental delete raro horas-días

Backup strategy

Layers

flowchart TD
    L1[Layer 1: pg_basebackup nightly + WAL continuous]
    L1R[retention 7 days local]
    L1P[recovery point: cualquier segundo de los últimos 7 días]
    L1 --> L1R --> L1P

    L2[Layer 2: pgBackRest weekly full + diff daily + WAL]
    L2R[S3 / GCS cross-region]
    L2RT[retention 90 días]
    L2P[recovery point: cualquier segundo de los últimos 90 días]
    L2 --> L2R --> L2RT --> L2P

    L3[Layer 3: Logical dumps pg_dump monthly]
    L3R[S3 immutable bucket - compliance]
    L3RT[retention 7 años]
    L3P[recovery point: snapshot mensual]
    L3 --> L3R --> L3RT --> L3P

Configuración (pgBackRest)

# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-type=s3
repo1-s3-bucket=wf-backups-prod
repo1-s3-endpoint=s3.amazonaws.com
repo1-s3-region=us-west-2
repo1-s3-key=...
repo1-s3-key-secret=...
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=<random>
repo1-retention-full=4   # 4 weekly fulls
repo1-retention-diff=14  # 14 daily diffs

[main]
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432

Cronjobs:

# Sunday: full backup
0 2 * * 0 postgres pgbackrest --stanza=main --type=full backup

# Mon-Sat: differential
0 2 * * 1-6 postgres pgbackrest --stanza=main --type=diff backup

# Every 15min: WAL push (configurado via archive_command)

Test de backup (mensual obligatorio)

Un backup que no se prueba no existe.

# Restore test en sandbox
pgbackrest --stanza=main --type=time --target='2026-05-14 12:00:00' restore
psql -c "SELECT COUNT(*) FROM process_instances WHERE created_at > '2026-05-14 11:55:00'"
# Verificar consistencia

Scenario 1: Engine node crash (M2+)

Síntomas: 1 nodo engine no responde, otros sanos.

Detección: - Healthcheck /readyz falla. - Métrica wf_engine_node_up == 0. - Alert: up{job="wf-engine"} == 0.

Recovery (auto): 1. K8s detecta liveness probe failed → reschedule pod. 2. Load balancer remueve nodo del pool. 3. Nuevo pod arranca, reconectaa Postgres, recibe traffic. 4. RTO esperado: 30-60s.

Runbook manual (si auto falla):

# 1. Verificar estado
kubectl get pods -n wf-engine
kubectl logs <failing-pod> --tail 100

# 2. Si pod en CrashLoop:
kubectl describe pod <failing-pod>
# - OOMKilled? → aumentar memory limit
# - DB connection refused? → check Postgres status
# - Migration mismatch? → rollback to previous version

# 3. Force restart si necesario
kubectl delete pod <failing-pod>

Scenario 2: Postgres primary failure (M2+)

Síntomas: writes fallan, replica leader-election en curso.

Detección: - Patroni emite event failover_initiated. - Métrica pg_replication_lag_bytes salta. - Alert: patroni_master_count != 1.

Recovery (auto con Patroni): 1. Patroni detecta primary down (~10s timeout). 2. Elige nuevo primary (replica con menor lag). 3. Promueve replica → primary. 4. Otras replicas se reapuntan al nuevo primary. 5. HAProxy / pgbouncer reapunta clients. 6. Engine reabre connections (auto-reconnect del driver). 7. RTO: 30-60s.

Runbook manual:

# 1. Confirmar estado con Patroni
patronictl -c /etc/patroni/patroni.yml list
# Output: ver leader actual

# 2. Si no hay failover automático:
patronictl -c /etc/patroni/patroni.yml failover --master <old> --candidate <new>

# 3. Verificar engine reconecta
curl http://wf-engine:8080/healthz
# Esperar "ok" después de reconnect

# 4. Investigar old primary
# - Disk full? df -h
# - OOM? dmesg | tail
# - Network partition? ip a
# Una vez reparado, reincorporar como replica:
patronictl -c /etc/patroni/patroni.yml reinit <old-name>

Scenario 3: Logical corruption (data inconsistency)

Síntomas: queries devuelven resultados raros, foreign key violations.

Detección: - Constraint violations en logs. - Métrica wf_engine_consistency_errors_total > 0. - Replay determinism check falla.

Diagnóstico:

-- Verificar integridad
SELECT COUNT(*) FROM jobs WHERE process_instance_key NOT IN (SELECT key FROM process_instances);
-- Si > 0: orphan jobs

SELECT COUNT(*) FROM element_instances WHERE process_instance_key IS NULL;
-- Si > 0: orphan elements

-- Hash check del command log
SELECT md5(string_agg(command_data::text, '' ORDER BY position))
FROM commands
WHERE position BETWEEN <start> AND <end>;
-- Compare con hash esperado

Recovery:

# Option A: Point-in-time restore (RPO depende del momento)
pgbackrest --stanza=main --type=time --target='2026-05-14 11:30:00' restore

# Option B: Reconstruir state desde command log
wf replay start --from-position 0 --to-position <current>
wf replay verify  # check hashes match

# Option C: Manual reparación (último recurso)
BEGIN;
DELETE FROM jobs WHERE process_instance_key NOT IN (SELECT key FROM process_instances);
COMMIT;

Scenario 4: Disk full

Síntomas: writes fallan con ENOSPC, engine empieza a rechazar comandos.

Detección: - pg_database_size > threshold. - node_filesystem_avail_bytes < 10%. - Alert: wf_db_disk_usage_percent > 85.

Recovery inmediato:

# 1. Verificar
df -h /var/lib/postgresql
psql -c "SELECT pg_size_pretty(pg_database_size('wfengine'))"

# 2. Quick wins
# Vacuum si bloat alto
psql -c "VACUUM FULL command_log"  # CUIDADO: lock pesado

# 3. Drop partitions viejas
psql -c "DROP TABLE commands_2025_01"  # antiguas

# 4. WAL files acumulados?
ls /var/lib/postgresql/16/main/pg_wal | wc -l
# Si miles: replica stuck? backup stuck?

# 5. Si nada funciona: expand storage
# k8s: kubectl edit pvc data-postgres-0  →  storage: +500Gi
# AWS: aws rds modify-db-instance --allocated-storage ...

Prevención: - Alerta a 70% para acción no urgente. - Alerta a 85% para acción inmediata. - Alerta a 95% page on-call. - Auto-grow PVC (k8s) con headroom.

Scenario 5: AZ outage (M3+)

Síntomas: 1 AZ entera inaccessible.

Detección: - Múltiples nodes en misma AZ down. - Cloud provider status page.

Recovery (auto si configurado): 1. K8s reschedule pods a otras AZ. 2. Postgres failover (si primary estaba en AZ caída). 3. Load balancer health checks remueven nodos. 4. RTO: 60s.

Validación:

# Confirmar capacity en otras AZ
kubectl get nodes -l topology.kubernetes.io/zone=us-west-2a,us-west-2b

# Confirmar Postgres primary en otra AZ
patronictl list

# Confirmar throughput recovered
curl http://prometheus/api/v1/query?query=rate(wf_engine_commands_processed_total[5m])

Scenario 6: Region outage (M4+)

Síntomas: región entera unreachable.

Detección: - Multi-region monitoring (DNS-based). - Provider status page.

Recovery (manual, RTO 10min):

# 1. Promover region secundaria a primary
# - Verificar lag
psql -h secondary-region -c "SELECT pg_last_wal_replay_lsn(), pg_last_wal_receive_lsn()"
# Si lag aceptable, promote
psql -h secondary-region -c "SELECT pg_promote()"

# 2. Update DNS / GLB
# Route 53 / Cloudflare update health check
aws route53 change-resource-record-sets ...

# 3. Apuntar workers a nueva region
# Workers via service discovery se reapuntan
kubectl rollout restart deploy/workers -n wf-workers

# 4. Comunicar a usuarios
# Status page update, customer comms

# 5. Cuando region primary recovers:
# - Reincorporar como secundario (no auto-failback)
# - Verificar consistency con replay verify

Scenario 7: Ransomware / accidental DELETE

Síntomas: data faltante / cifrada.

Detección: - Audit log: actor inesperado. - Métricas: row count cayó. - Application errors: "not found".

Recovery:

# 1. AISLAR: cortar acceso al sistema comprometido
kubectl scale deploy/wf-engine --replicas=0

# 2. PRESERVAR: snapshot del estado actual ANTES de cualquier cambio
pgbackrest --stanza=main backup --type=full
# Para forensics

# 3. RESTAURAR desde último backup limpio (pre-incident)
pgbackrest --stanza=main --type=time --target='2026-05-13 08:00:00' restore
# (Antes del momento del ataque)

# 4. VERIFICAR
psql -c "SELECT COUNT(*) FROM process_instances WHERE created_at > '2026-05-13 08:00:00'"
# Cuántos perdidos

# 5. RECONSTRUIR data perdida si posible
# - Re-ingest desde upstream events
# - Customer comms para los procesos perdidos

# 6. POSTMORTEM
# - Causa raíz
# - Audit trail
# - Hardening

Prevención: - Immutable backups (S3 Object Lock). - Audit log con tamper-detection. - Least-privilege RBAC. - Network segmentation. - Backup encryption keys separadas del engine.

Drills (chaos engineering)

Disciplina: ejecutar drills periódicos para validar runbooks.

Quarterly drill schedule

Q1: Pod crash + auto-recovery (M2+)
Q2: Postgres failover (M2+)
Q3: AZ outage simulation (M3+)
Q4: Full region failover (M4+)
Annual: Backup restore from S3 (cross-region)

Chaos tools

  • chaos-mesh o Litmus: k8s-native chaos.
  • AWS Fault Injection Simulator: AZ outage simulation.
  • Patroni manual failover: drill para Postgres.
# chaos-mesh: kill random engine pod
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-test
spec:
  action: pod-failure
  mode: random-max-percent
  value: "30"
  duration: "1m"
  selector:
    labelSelectors:
      app: wf-engine

Drill report template

## Drill: <name> — <date>

### Setup
- Cluster: prod-east-1
- Version: v1.2.3
- Date: 2026-05-14
- Conductor: paulo

### Scenario
<description>

### Timeline
- T+0: trigger applied
- T+12s: alert fired (PagerDuty)
- T+25s: on-call ack
- T+90s: service recovered

### Metrics
- RTO actual: 90s (target 60s) ⚠️
- RPO actual: 0 (target 30s) ✅
- p99 latency during incident: 450ms ⚠️

### Findings
- Alert delay ~12s (mejorable)
- Engine reconnect took 30s (esperado <10s)

### Actions
- [ ] Tune alert threshold
- [ ] Investigate engine reconnect logic

Observability for DR

wf_dr_backup_age_seconds  # tiempo desde último backup
wf_dr_backup_size_bytes
wf_dr_wal_lag_bytes  # primary vs replicas
wf_dr_replication_lag_seconds
wf_dr_failover_total{from, to}
wf_dr_restore_duration_seconds{type}

Alertas críticas: - wf_dr_backup_age_seconds > 86400 → backup stuck. - wf_dr_replication_lag_seconds > 30 → replica falling behind. - wf_dr_wal_lag_bytes > 10GB → catch-up imposible.

Audit / compliance

DR procedures deben quedar documentados para auditorías SOC 2 / ISO 27001:

  • Runbooks versionados en git.
  • Drill reports archivados.
  • Backup verification reports.
  • Access logs a backups.

Ver analysis/compliance-roadmap.

Checklist post-incident

  • Service restored to healthy state.
  • Customer comms sent.
  • Post-mortem scheduled (within 48h).
  • Root cause documented.
  • Action items tracked.
  • Runbook updated if gaps found.
  • Monitoring/alerting updated.
  • Drill scheduled to verify fix.

Referencias