ADR-004: NO forkear protocolos de consensus¶
- Status: Accepted
- Date: 2026-05-14
- Tags: infrastructure, consensus, ha
Context and Problem Statement¶
Cuando eventualmente el MVP necesite HA real (Phase 1+), requerirá algún mechanism de consensus para leader election y/or replication. Camunda forkeó Atomix (Raft + SWIM) y documenta sus costos. ¿Replicamos su path, usamos library existente, o evitamos consensus completamente?
Decision Drivers¶
- Camunda documentó explícitamente el costo del fork (ver concepts/atomix-fork-lessons)
- Maintenance burden de consensus protocols es altísimo (require expertise sostenida)
- Libraries existentes (hashicorp/raft, etcd/raft) son battle-tested
- Postgres ecosystem provee HA sin consensus custom
Considered Options¶
- NO consensus protocols custom — usar Postgres replication + Patroni
- Library existente — hashicorp/raft o etcd/raft
- External coordination — Consul/etcd para leader election
- Fork Atomix (path de Camunda)
- Implement Raft from scratch
Decision Outcome¶
Chosen option: NO consensus protocols custom, usar Postgres replication + Patroni para HA. Si en Phase 4+ se necesita consensus distribuido más allá de Postgres, usar library existente (hashicorp/raft) — NUNCA fork ni implement from scratch.
Positive Consequences¶
- Cero código de consensus a mantener
- HA via Patroni maduro (battle-tested en Consul, Vault, ~10 años)
- Failover automático en ~30 segundos
- No expertise rara requerida (Patroni docs son excelentes)
- Saves los 4 años de R&D que Camunda invirtió
Negative Consequences¶
- Failover más lento que Raft puro (~30s Patroni vs <1s Raft)
- Limited a "single-leader" pattern (Postgres replication es leader-follower)
- Para multi-leader writes requiere CockroachDB/Yugabyte (Phase 6)
- Postgres replication lag puede causar data loss en disaster (~5s WAL streaming)
Pros and Cons of the Options¶
NO consensus custom (Patroni)¶
Pros: - Cero código consensus a mantener - Patroni maduro y battle-tested - Failover automático - Excellent docs - Ecosystem-supported
Cons: - Failover ~30s (vs <1s Raft) - Limited a leader-follower
Library existente (hashicorp/raft, etcd/raft)¶
Pros: - Battle-tested at scale (Consul, Kubernetes) - Active maintenance - Bug fixes upstream - Documentation y community
Cons: - Aún hay integration code - Library limitations - Go-specific (mostly)
External coordination (Consul/etcd)¶
Pros: - Outsource consensus completamente - Patrones probados
Cons: - Dependencia operacional adicional - Latencia por network roundtrip - Multi-service coordination
Fork Atomix (Camunda path)¶
Pros: - Control total
Cons (todos documentados por Camunda en su README oficial): - Build complexity ("breaking branches") - Snapshot version inestabilidad - Release coupling - Tests flakiness - ~50% código heredado removido por relevance - Toda responsabilidad de bug fixing internal - Camunda invirtió ~4 años de R&D
Implement Raft from scratch¶
Pros: - Aprendizaje - Customizable
Cons: - 1-2 años de dev solo para basic correctness - Tests exhaustivos (Jepsen-level) requeridos - Bugs sutiles toman años en descubrirse - Effectively re-inventing existing solutions
El costo documentado por Camunda¶
Citas literales del README del módulo atomix de Camunda:
"We always had the problem that when we fixed or changed something in atomix we needed to release a new version to use it in Zeebe. Sometimes it happens that we just released the newest version of atomix on the day we wanted to released Zeebe, which sometimes broke the build."
"We switched to using snapshot versions, which improved this a bit. But if we then changed something it could happen that we broke develop and other branches in the Zeebe Repo."
"It was not easy to develop and test, since if you did a change in atomix you needed to build this locally, build then Zeebe locally and then run tests or a benchmark."
Estos pain points son garantizados si forkeas un consensus protocol. Camunda lo asumió porque su volumen lo justificaba. El MVP NO.
Triggers para reconsiderar¶
Solo reconsiderar fork si TODOS estos son true: 1. Volume excede 100K TPS sostenido 2. Latency P99 < 1s es business-critical 3. Patroni failover (~30s) es inaceptable 4. Tienes 2+ engineers full-time dedicados a consensus 5. Negocio sostenible para 5+ años con ese investment
Para 99.9% de casos, ninguno aplica.
Links¶
- concepts/atomix-fork-lessons — Costos detallados del fork de Camunda
- adrs/adr-020-patroni-postgres-ha — Solución HA usando Patroni
- concepts/raft-consensus — Raft theory
- concepts/swim-membership-protocol — SWIM theory
- hashicorp/raft
- etcd-io/raft
- Patroni