ADR-020: Patroni para Postgres HA¶
- Status: Accepted
- Date: 2026-05-14
- Tags: infrastructure, ha, postgres, phase-1
Context and Problem Statement¶
Phase 0 del MVP es single-node Postgres. Phase 1 requiere HA para SLA > 99%. ¿Qué tool de HA usamos?
Decision Drivers¶
- Patroni es maduro (battle-tested en Consul, Vault, Stack Overflow, etc.)
- Failover automático en ~30s
- Streaming replication built-in de Postgres
- etcd/Consul backend para coordination
Considered Options¶
- Patroni — orchestration layer encima de Postgres
- Stolon — alternative orchestrator
- repmgr — replication manager (older)
- pg_auto_failover — Microsoft-developed
- Cloud managed (RDS Multi-AZ, Cloud SQL HA)
- Custom failover scripting
Decision Outcome¶
Chosen option: Patroni para self-hosted, cloud managed HA para SaaS deployments.
Positive Consequences¶
- Failover automático (no human intervention)
- Battle-tested at scale
- etcd como DCS (distributed config store) standard
- Excellent docs
- Active community
- CLI tooling maduro (
patronictl)
Negative Consequences¶
- Operational complexity (3+ Postgres nodes + etcd cluster)
- etcd requires its own ops
- Failover ~30s (slower than Raft theoretical <1s)
- Split-brain risks if etcd partitioned
Architecture¶
flowchart TD
PGB["PgBouncer<br/>(connection pool)"]
subgraph PGCluster[Postgres cluster]
PRIM["PG Primary<br/>(writes)"]
REP1["PG Replica 1<br/>(hot standby)"]
REP2["PG Replica 2<br/>(hot standby)"]
end
subgraph ETCD["etcd cluster (3 nodes)"]
E["• Leader election state<br/>• Cluster configuration"]
end
PGB -->|"writes → primary"| PRIM
PGB -->|"reads → any"| REP1
PGB -->|"reads → any"| REP2
PRIM <-->|repl| REP1
PRIM -->|repl| REP2
PRIM -.->|Patroni agent| ETCD
REP1 -.->|Patroni agent| ETCD
REP2 -.->|Patroni agent| ETCD
Configuration ejemplo¶
# /etc/patroni/patroni.yml
scope: workflow-engine
namespace: /service/
name: pg-node-1
restapi:
listen: 0.0.0.0:8008
connect_address: pg-node-1.internal:8008
etcd3:
hosts:
- etcd-1.internal:2379
- etcd-2.internal:2379
- etcd-3.internal:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB max lag for failover candidate
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
max_connections: 200
max_wal_size: 4GB
wal_level: replica
hot_standby: "on"
max_wal_senders: 10
max_replication_slots: 10
wal_keep_size: 1GB
archive_mode: "on"
archive_command: "pgbackrest --stanza=main archive-push %p"
initdb:
- encoding: UTF8
- data-checksums # critical for integrity detection
postgresql:
listen: 0.0.0.0:5432
connect_address: pg-node-1.internal:5432
data_dir: /var/lib/postgresql/15/main
bin_dir: /usr/lib/postgresql/15/bin
authentication:
replication:
username: replicator
password: ${REPLICATOR_PASSWORD}
superuser:
username: postgres
password: ${POSTGRES_PASSWORD}
Operational procedures¶
Verificar cluster¶
patronictl -c /etc/patroni/patroni.yml list
# Output:
# + Cluster: workflow-engine -+---------+--------+
# | Member | Host | Role | State |
# +-----------+------------------+---------+--------+
# | pg-node-1 | pg-node-1.internal | Leader | running|
# | pg-node-2 | pg-node-2.internal | Replica | running|
# | pg-node-3 | pg-node-3.internal | Replica | running|
# +-----------+------------------+---------+--------+
Manual failover (planned maintenance)¶
Automatic failover (unplanned)¶
Si primary dies: 1. Patroni en replicas detecta (TTL 30s) 2. Leader election via etcd 3. Highest-LSN replica promoted 4. PgBouncer reconnects to new primary 5. Application transparent (con retry logic)
Total time: ~30s.
App-side considerations¶
Connection string usa PgBouncer (no direct PG):
PgBouncer config:
[databases]
workflow = host=pg-primary.internal port=5432
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
pg-primary.internal es DNS que Patroni mantiene apuntando al primary actual.
Retry logic en app¶
async def execute_with_retry(query, max_retries=3):
for attempt in range(max_retries):
try:
return await pool.execute(query)
except (ConnectionError, OperationalError):
if attempt == max_retries - 1:
raise
await asyncio.sleep(2 ** attempt) # exponential backoff
Durante failover (~30s), retries cubren la transition.
Backup integration¶
Patroni + pgBackRest:
postgresql:
parameters:
archive_mode: "on"
archive_command: "pgbackrest --stanza=main archive-push %p"
pgBackRest config:
# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=7
repo1-type=s3
repo1-s3-bucket=my-pg-backups
repo1-s3-region=us-east-1
repo1-cipher-type=aes-256-cbc
[main]
pg1-path=/var/lib/postgresql/15/main
Full backup daily + WAL streaming continuous = RPO ~5 minutes.
Failure modes¶
| Failure | Behavior | RTO | RPO |
|---|---|---|---|
| Primary dies | Patroni promotes replica | ~30s | ~5s (WAL lag) |
| Replica dies | No impact (still 1 replica + primary) | 0s | 0 |
| etcd quorum lost | Patroni read-only mode (safety) | Until restored | 0 |
| Network partition primary | Self-fence, replica promotes | ~30s | ~5s |
| Disk full primary | Patroni alerts, failover manual | manual | depends |
| Disk corruption | Detected by data-checksums | Restore from backup | depends |
Cuándo usar cloud managed en su lugar¶
Para SaaS deployments:
| Cloud | Service | Trade-offs |
|---|---|---|
| AWS | RDS Multi-AZ | Easy, expensive |
| GCP | Cloud SQL HA | Easy, expensive |
| Azure | Database for PostgreSQL Flexible Server | Easy, expensive |
Trade-offs: - Pro: zero ops, managed - Con: 2-3x cost vs self-hosted Patroni - Con: vendor lock-in - Con: less control (no custom Postgres extensions sometimes)
Recomendación: - MVP early stage: cloud managed (focus on product) - Scale up: self-hosted Patroni (cost savings significant)
Links¶
- adrs/adr-002-postgresql-as-state-store — Foundation
- adrs/adr-003-single-node-mvp-incremental-scaling — Phase 1 trigger
- analysis/scaling-strategy-postgres — Roadmap completo
- Patroni
- pgBackRest
- PgBouncer