Saltar a contenido

ADR-007: At-least-once delivery, workers idempotentes

  • Status: Accepted
  • Date: 2026-05-14
  • Tags: core, semantics, workers

Context and Problem Statement

Cuando el engine envía un job a un worker, hay 3 modelos de delivery semantics: at-most-once (puede perderse), at-least-once (puede duplicarse), exactly-once (preciso pero costoso). ¿Cuál usamos?

Decision Drivers

  • Exactly-once es prohibitivamente complejo (requires distributed transactions, ver Two-Generals problem)
  • At-most-once arruina business workflows (data loss inaceptable)
  • Camunda usa at-least-once y validó en production (Intuit: 100 dups de 25M = 0.0004%)
  • Idempotency es responsabilidad del worker (well-understood pattern)

Considered Options

  1. At-least-once + idempotent workers (Camunda 8 pattern)
  2. At-most-once (worker no recibe duplicates pero puede perder jobs)
  3. Exactly-once (precisión absoluta)
  4. Exactly-once con outbox pattern (transactional outbox)

Decision Outcome

Chosen option: At-least-once + idempotent workers porque: - Es el pragmatic standard (Kafka, Pulsar, AWS SQS, todos at-least-once) - Exactly-once es teóricamente imposible cross-system (Two-Generals) - Workers idempotency es well-documented pattern - Validated por Intuit production deployment - Engineering burden está en workers, no en engine (mejor encapsulation)

Positive Consequences

  • Engine simpler (no exactly-once complexity)
  • Failure handling clean (retry siempre)
  • Compatible con todos los message brokers/queues
  • Standard pattern - workers code es portable
  • No distributed transactions

Negative Consequences

  • Workers DEBEN ser idempotent (responsabilidad del developer)
  • Duplicate processing posible en edge cases
  • Documentation/onboarding burden (educar developers)
  • Some workers naturalmente difícil de hacer idempotent (e.g., send email)

Pros and Cons of the Options

At-least-once + idempotent workers

Pros: - Pragmatic standard - Engine simple - Compatible con todo el ecosystem - Failure recovery clean

Cons: - Workers responsibility - Duplicates posibles (raros, pero posibles) - Educational burden

At-most-once

Pros: - Sin duplicates - Engine simple también

Cons: - Data loss inaceptable para workflows business-critical - Failures = lost work - Worse para reliability

Exactly-once

Pros: - "Perfect" semantics

Cons: - Theoretically imposible cross-system - Requires distributed transactions - Locks and coordination overhead - En práctica, "exactly-once" claims son at-least-once + idempotency under the hood - Performance horrible

Exactly-once con outbox pattern

Pros: - Closer to exactly-once para single service - Pattern conocido

Cons: - Solo funciona dentro de un service boundary - Cross-service sigue siendo at-least-once - Adds complexity (outbox table, polling) - Net effect similar a at-least-once + idempotency

Implementación: contrato con workers

Documentation (developer onboarding)

# Worker Development Guide

## CRITICAL: Idempotency requirement

Your worker WILL receive duplicate jobs occasionally (network glitches,
worker crashes mid-completion, timeouts, etc.). This is NOT a bug.

Your worker MUST be idempotent. Three strategies:

### Strategy 1: Natural idempotency
Operations that produce same result regardless of execution count:
- `confirmCustomer(id)` — confirming already-confirmed customer is no-op
- `setOrderStatus(id, "shipped")` — setting status that's already set

### Strategy 2: Business idempotency
Check before mutate using business identifiers:
- Before creating order, check if order with this `correlationId` exists
- Before sending email, check if email was already sent (audit table)

### Strategy 3: Deduplication via correlation ID
Use the `jobKey` or business correlation ID as dedup key:
\`\`\`python
async def handle_job(job):
    if await dedup_store.exists(job.key):
        return job.result_cached
    result = await do_work(job)
    await dedup_store.store(job.key, result)
    return result
\`\`\`

Engine support para idempotency

El engine puede ayudar:

// Worker SDK helper
const worker = new JobWorker({
  client,
  jobType: 'send-email',
  // SDK genera dedup key automáticamente
  dedupStore: new RedisDedupStore({ url: 'redis://...' }),
  ttl: 24 * 3600,  // 24h dedup window
  handler: async (job) => {
    // Si esto es retry de un job ya completado, SDK skip
    await sendEmail(job.variables);
  }
});

Por qué duplicates ocurren

Scenarios reales (todos llevan a duplicate):

  1. Worker crashea mid-processing: engine no recibió ACK, re-envía. Worker ejecutó parcialmente.
  2. Network timeout en ACK: worker completó, ACK se perdió en red, engine re-envía.
  3. Timeout expiró: worker tardó más que job timeout, engine asigna a otro worker mientras el primero sigue procesando.
  4. Failover de engine: leader cambió, nuevo leader no sabe qué jobs el viejo había distribuido.

No se pueden eliminar estos scenarios sin perder availability.

Validación: Intuit benchmark

De analysis/intuit-production-benchmarks:

"100 duplicate events from 25 million processes — root cause under investigation"

100 / 25,000,000 = 0.0004% duplicate rate en condiciones extremas (15h tsunami test).

Esto refuerza: - Duplicates son raros pero ocurren - Workers idempotents handle gracefully - No data loss (eso sería worse)

Failure modes mitigation

Failure Engine behavior Worker requirement
Worker crashea Job reactiva después de timeout Idempotent: re-execute OK
Network glitch ACK Job reactiva Idempotent: re-execute OK
Timeout expira Job reactiva (other worker) Idempotent: safe
Engine failover Pending jobs redistributed Idempotent: safe
DB transaction fail Job NOT marked complete, will retry Idempotent: safe

Special cases

Workers difícil de hacer idempotent

Send email

Problem: enviar email duplicado spamea al usuario. Solution: dedup store antes de send.

async def send_email(job):
    dedup_key = f"email:{job.key}"
    if await redis.exists(dedup_key):
        return  # already sent
    await provider.send(...)
    await redis.setex(dedup_key, 86400, "1")  # 24h

Charge credit card

Problem: charge duplicado es catastrófico. Solution: idempotency key del payment provider.

async def charge(job):
    # Stripe, etc. accept idempotency key
    return await stripe.charge(
        amount=job.amount,
        currency='usd',
        idempotency_key=f"workflow-{job.key}"
    )
    # Stripe garantiza: same idempotency_key = same response

Database INSERT

Problem: duplicate insert puede fallar (unique constraint) o crear duplicate. Solution: INSERT ... ON CONFLICT DO NOTHING o use UUID generado por correlation.

Cuándo at-most-once es aceptable

Casos raros donde duplicates serían peores que data loss: - Notificaciones tipo "you might be interested in X" — no critical - Analytics events (sampling es OK)

Para estos, el worker puede deliberadamente NO ser idempotent. Es decision del developer.