ADR-007: At-least-once delivery, workers idempotentes¶
- Status: Accepted
- Date: 2026-05-14
- Tags: core, semantics, workers
Context and Problem Statement¶
Cuando el engine envía un job a un worker, hay 3 modelos de delivery semantics: at-most-once (puede perderse), at-least-once (puede duplicarse), exactly-once (preciso pero costoso). ¿Cuál usamos?
Decision Drivers¶
- Exactly-once es prohibitivamente complejo (requires distributed transactions, ver Two-Generals problem)
- At-most-once arruina business workflows (data loss inaceptable)
- Camunda usa at-least-once y validó en production (Intuit: 100 dups de 25M = 0.0004%)
- Idempotency es responsabilidad del worker (well-understood pattern)
Considered Options¶
- At-least-once + idempotent workers (Camunda 8 pattern)
- At-most-once (worker no recibe duplicates pero puede perder jobs)
- Exactly-once (precisión absoluta)
- Exactly-once con outbox pattern (transactional outbox)
Decision Outcome¶
Chosen option: At-least-once + idempotent workers porque: - Es el pragmatic standard (Kafka, Pulsar, AWS SQS, todos at-least-once) - Exactly-once es teóricamente imposible cross-system (Two-Generals) - Workers idempotency es well-documented pattern - Validated por Intuit production deployment - Engineering burden está en workers, no en engine (mejor encapsulation)
Positive Consequences¶
- Engine simpler (no exactly-once complexity)
- Failure handling clean (retry siempre)
- Compatible con todos los message brokers/queues
- Standard pattern - workers code es portable
- No distributed transactions
Negative Consequences¶
- Workers DEBEN ser idempotent (responsabilidad del developer)
- Duplicate processing posible en edge cases
- Documentation/onboarding burden (educar developers)
- Some workers naturalmente difícil de hacer idempotent (e.g., send email)
Pros and Cons of the Options¶
At-least-once + idempotent workers¶
Pros: - Pragmatic standard - Engine simple - Compatible con todo el ecosystem - Failure recovery clean
Cons: - Workers responsibility - Duplicates posibles (raros, pero posibles) - Educational burden
At-most-once¶
Pros: - Sin duplicates - Engine simple también
Cons: - Data loss inaceptable para workflows business-critical - Failures = lost work - Worse para reliability
Exactly-once¶
Pros: - "Perfect" semantics
Cons: - Theoretically imposible cross-system - Requires distributed transactions - Locks and coordination overhead - En práctica, "exactly-once" claims son at-least-once + idempotency under the hood - Performance horrible
Exactly-once con outbox pattern¶
Pros: - Closer to exactly-once para single service - Pattern conocido
Cons: - Solo funciona dentro de un service boundary - Cross-service sigue siendo at-least-once - Adds complexity (outbox table, polling) - Net effect similar a at-least-once + idempotency
Implementación: contrato con workers¶
Documentation (developer onboarding)¶
# Worker Development Guide
## CRITICAL: Idempotency requirement
Your worker WILL receive duplicate jobs occasionally (network glitches,
worker crashes mid-completion, timeouts, etc.). This is NOT a bug.
Your worker MUST be idempotent. Three strategies:
### Strategy 1: Natural idempotency
Operations that produce same result regardless of execution count:
- `confirmCustomer(id)` — confirming already-confirmed customer is no-op
- `setOrderStatus(id, "shipped")` — setting status that's already set
### Strategy 2: Business idempotency
Check before mutate using business identifiers:
- Before creating order, check if order with this `correlationId` exists
- Before sending email, check if email was already sent (audit table)
### Strategy 3: Deduplication via correlation ID
Use the `jobKey` or business correlation ID as dedup key:
\`\`\`python
async def handle_job(job):
if await dedup_store.exists(job.key):
return job.result_cached
result = await do_work(job)
await dedup_store.store(job.key, result)
return result
\`\`\`
Engine support para idempotency¶
El engine puede ayudar:
// Worker SDK helper
const worker = new JobWorker({
client,
jobType: 'send-email',
// SDK genera dedup key automáticamente
dedupStore: new RedisDedupStore({ url: 'redis://...' }),
ttl: 24 * 3600, // 24h dedup window
handler: async (job) => {
// Si esto es retry de un job ya completado, SDK skip
await sendEmail(job.variables);
}
});
Por qué duplicates ocurren¶
Scenarios reales (todos llevan a duplicate):
- Worker crashea mid-processing: engine no recibió ACK, re-envía. Worker ejecutó parcialmente.
- Network timeout en ACK: worker completó, ACK se perdió en red, engine re-envía.
- Timeout expiró: worker tardó más que job timeout, engine asigna a otro worker mientras el primero sigue procesando.
- Failover de engine: leader cambió, nuevo leader no sabe qué jobs el viejo había distribuido.
No se pueden eliminar estos scenarios sin perder availability.
Validación: Intuit benchmark¶
De analysis/intuit-production-benchmarks:
"100 duplicate events from 25 million processes — root cause under investigation"
100 / 25,000,000 = 0.0004% duplicate rate en condiciones extremas (15h tsunami test).
Esto refuerza: - Duplicates son raros pero ocurren - Workers idempotents handle gracefully - No data loss (eso sería worse)
Failure modes mitigation¶
| Failure | Engine behavior | Worker requirement |
|---|---|---|
| Worker crashea | Job reactiva después de timeout | Idempotent: re-execute OK |
| Network glitch ACK | Job reactiva | Idempotent: re-execute OK |
| Timeout expira | Job reactiva (other worker) | Idempotent: safe |
| Engine failover | Pending jobs redistributed | Idempotent: safe |
| DB transaction fail | Job NOT marked complete, will retry | Idempotent: safe |
Special cases¶
Workers difícil de hacer idempotent¶
Send email¶
Problem: enviar email duplicado spamea al usuario. Solution: dedup store antes de send.
async def send_email(job):
dedup_key = f"email:{job.key}"
if await redis.exists(dedup_key):
return # already sent
await provider.send(...)
await redis.setex(dedup_key, 86400, "1") # 24h
Charge credit card¶
Problem: charge duplicado es catastrófico. Solution: idempotency key del payment provider.
async def charge(job):
# Stripe, etc. accept idempotency key
return await stripe.charge(
amount=job.amount,
currency='usd',
idempotency_key=f"workflow-{job.key}"
)
# Stripe garantiza: same idempotency_key = same response
Database INSERT¶
Problem: duplicate insert puede fallar (unique constraint) o crear duplicate. Solution: INSERT ... ON CONFLICT DO NOTHING o use UUID generado por correlation.
Cuándo at-most-once es aceptable¶
Casos raros donde duplicates serían peores que data loss: - Notificaciones tipo "you might be interested in X" — no critical - Analytics events (sampling es OK)
Para estos, el worker puede deliberadamente NO ser idempotent. Es decision del developer.
Links¶
- concepts/job-worker-pattern — Worker pattern detail
- analysis/error-handling-patterns — Idempotency strategies
- analysis/intuit-production-benchmarks — Real-world duplicate rate
- adrs/adr-016-minimal-outbound-worker-sdk — SDK que facilita esto