ADR-004: NO forkear protocolos de consensus¶

Status: Accepted
Date: 2026-05-14
Tags: infrastructure, consensus, ha

Context and Problem Statement¶

Cuando eventualmente el MVP necesite HA real (Phase 1+), requerirá algún mechanism de consensus para leader election y/or replication. Camunda forkeó Atomix (Raft + SWIM) y documenta sus costos. ¿Replicamos su path, usamos library existente, o evitamos consensus completamente?

Decision Drivers¶

Camunda documentó explícitamente el costo del fork (ver atomix fork lessons)
Maintenance burden de consensus protocols es altísimo (require expertise sostenida)
Libraries existentes (hashicorp/raft, etcd/raft) son battle-tested
Postgres ecosystem provee HA sin consensus custom

Considered Options¶

NO consensus protocols custom — usar Postgres replication + Patroni
Library existente — hashicorp/raft o etcd/raft
External coordination — Consul/etcd para leader election
Fork Atomix (path de Camunda)
Implement Raft from scratch

Decision Outcome¶

Chosen option: NO consensus protocols custom, usar Postgres replication + Patroni para HA. Si en Phase 4+ se necesita consensus distribuido más allá de Postgres, usar library existente (hashicorp/raft) — NUNCA fork ni implement from scratch.

Positive Consequences¶

Cero código de consensus a mantener
HA via Patroni maduro (battle-tested en Consul, Vault, ~10 años)
Failover automático en ~30 segundos
No expertise rara requerida (Patroni docs son excelentes)
Saves los 4 años de R&D que Camunda invirtió

Negative Consequences¶

Failover más lento que Raft puro (~30s Patroni vs <1s Raft)
Limited a "single-leader" pattern (Postgres replication es leader-follower)
Para multi-leader writes requiere CockroachDB/Yugabyte (Phase 6)
Postgres replication lag puede causar data loss en disaster (~5s WAL streaming)

Pros and Cons of the Options¶

NO consensus custom (Patroni)¶

Pros: - Cero código consensus a mantener - Patroni maduro y battle-tested - Failover automático - Excellent docs - Ecosystem-supported

Cons: - Failover ~30s (vs <1s Raft) - Limited a leader-follower

Library existente (hashicorp/raft, etcd/raft)¶

Pros: - Battle-tested at scale (Consul, Kubernetes) - Active maintenance - Bug fixes upstream - Documentation y community

Cons: - Aún hay integration code - Library limitations - Go-specific (mostly)

External coordination (Consul/etcd)¶

Pros: - Outsource consensus completamente - Patrones probados

Cons: - Dependencia operacional adicional - Latencia por network roundtrip - Multi-service coordination

Fork Atomix (Camunda path)¶

Pros: - Control total

Cons (todos documentados por Camunda en su README oficial): - Build complexity ("breaking branches") - Snapshot version inestabilidad - Release coupling - Tests flakiness - ~50% código heredado removido por relevance - Toda responsabilidad de bug fixing internal - Camunda invirtió ~4 años de R&D

Implement Raft from scratch¶

Pros: - Aprendizaje - Customizable

Cons: - 1-2 años de dev solo para basic correctness - Tests exhaustivos (Jepsen-level) requeridos - Bugs sutiles toman años en descubrirse - Effectively re-inventing existing solutions

El costo documentado por Camunda¶

Citas literales del README del módulo atomix de Camunda:

"We always had the problem that when we fixed or changed something in atomix we needed to release a new version to use it in Zeebe. Sometimes it happens that we just released the newest version of atomix on the day we wanted to released Zeebe, which sometimes broke the build."

"We switched to using snapshot versions, which improved this a bit. But if we then changed something it could happen that we broke develop and other branches in the Zeebe Repo."

"It was not easy to develop and test, since if you did a change in atomix you needed to build this locally, build then Zeebe locally and then run tests or a benchmark."

Estos pain points son garantizados si forkeas un consensus protocol. Camunda lo asumió porque su volumen lo justificaba. El MVP NO.

Triggers para reconsiderar¶

Solo reconsiderar fork si TODOS estos son true: 1. Volume excede 100K TPS sostenido 2. Latency P99 < 1s es business-critical 3. Patroni failover (~30s) es inaceptable 4. Tienes 2+ engineers full-time dedicados a consensus 5. Negocio sostenible para 5+ años con ese investment

Para 99.9% de casos, ninguno aplica.

Links¶

atomix fork lessons — Costos detallados del fork de Camunda
adr 020 patroni postgres ha — Solución HA usando Patroni
raft consensus — Raft theory
swim membership protocol — SWIM theory
hashicorp/raft
etcd-io/raft
Patroni