Compensación y BPMN error handling¶

Semántica de excepción y reversión en BPMN. Errors (boundary events) son la primaria; compensation es el "undo" coordinado para sagas. Implementación M2.

Modelo mental: 3 mecanismos distintos¶

BPMN tiene tres formas de manejar lo que en código sería un try/catch/finally:

BPMN errors: error semántico de negocio (insufficient funds, customer not found). Sale por un boundary error event. Interrupting la actividad.
Incidents: error técnico inesperado (DB caída, worker bug). Requiere intervención humana / operacional.
Compensation: rollback coordinado de pasos previos. Trigger explícito vía compensation throw event.

Importante: BPMN no tiene "try/catch genérico". Cada error tiene un code declarado; el modelo lo enruta.

BPMN Error Boundary Event¶

Diagrama (mental)¶

flowchart LR
    A[Service Task:<br/>charge-payment] --> B[next step]
    A -.->|boundary error<br/>code='insufficient-funds'| C[User Task:<br/>ask-different-payment]

Semántica de ejecución¶

Worker ejecuta el job de charge-payment.
Worker llama failJobWithBPMNError("insufficient-funds", "balance < amount").
Engine recibe el comando, busca boundary error event con code="insufficient-funds" en el scope de la activity.
Si encuentra match:
Interrumpe la activity (clean-up).
Activa el flow que sale del boundary event.
Si NO encuentra match:
Sube el error al scope padre (subprocess → process).
Si llega al root sin match → incident (uncaught BPMN error).

Implementación (Postgres state)¶

-- Tabla de subscriptions por scope (resolution rápida)
CREATE TABLE error_event_subscriptions (
    process_instance_key BIGINT,
    element_instance_key BIGINT,  -- la activity con el boundary
    scope_key BIGINT,             -- scope donde aplica
    error_code TEXT,
    boundary_element_id TEXT,
    PRIMARY KEY (process_instance_key, element_instance_key, error_code)
);

-- Lookup en activación
SELECT boundary_element_id
FROM error_event_subscriptions
WHERE process_instance_key = $1
  AND error_code = $2
  AND scope_key IN (
      WITH RECURSIVE parents AS (
          SELECT key, parent_key FROM element_instances WHERE key = $3
          UNION ALL
          SELECT ei.key, ei.parent_key
          FROM element_instances ei
          JOIN parents p ON ei.key = p.parent_key
      )
      SELECT key FROM parents
  )
ORDER BY array_length(string_to_array(scope_key::text, ','), 1) DESC  -- innermost first
LIMIT 1;

Reglas de bubbling¶

Error sin handler en scope actual → bubble up.
Error sin handler en process raíz → incident.
Error en event subprocess (start con error) → handler activated, original flow termina.
Error en multi-instance → cada instancia se evalúa independientemente.

Variables al error¶

Al disparar el boundary event: - Si el worker pasó variables al fail, se mergean al scope. - Si el boundary tiene outputMapping, se aplica. - El error code y message quedan disponibles como variables: errorCode, errorMessage.

Incidents¶

Cuándo se crea un incident¶

Uncaught BPMN error: error sin boundary handler en ningún scope.
Job retry exhausted: retries: 0 en el job (típicamente N intentos fallaron).
Expression evaluation error: FEEL/CEL falló y no hay fallback.
Variable not found: referencia a variable inexistente.
External system unreachable (conector): después de retry interno.

Lifecycle¶

stateDiagram-v2
    [*] --> CREATED
    CREATED --> RESOLVED: resolve<br/>(--update-retries=N)
    RESOLVED --> CompletedOrNewIncident: resume
    CREATED --> CANCELLED: cancel-instance
    CompletedOrNewIncident: completed or new incident

Esquema:

CREATE TABLE incidents (
    key BIGINT PRIMARY KEY,
    process_instance_key BIGINT NOT NULL,
    element_instance_key BIGINT,
    job_key BIGINT,
    error_type TEXT NOT NULL,  -- JOB_NO_RETRIES, EXTRACT_VALUE_ERROR, etc.
    error_message TEXT NOT NULL,
    state TEXT NOT NULL,  -- CREATED, RESOLVED
    tenant_id TEXT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX ON incidents (state, tenant_id);
CREATE INDEX ON incidents (process_instance_key);

Resolución típica (Operate UI / `wf` CLI)¶

# Diagnosticar
wf incident get 22518...
wf instance timeline <instance-key>

# Reparar la causa
wf variable set <instance-key> amount=100.00  # corregir variable

# Reintentar
wf incident resolve 22518... --update-retries 3

Compensation¶

Caso de uso: saga distribuida¶

Imaginar un proceso de booking de viaje:

flowchart LR
    A[Book flight] --> B[Book hotel] --> C[Book car] --> D[Confirm trip]

Si "Book car" falla y queremos cancelar flight + hotel: compensation.

flowchart TD
    BF[Book flight] -.->|comp| CF[Cancel flight]
    BH[Book hotel] -.->|comp| CH[Cancel hotel]
    BC["Book car (failed)"]:::failed
    classDef failed fill:#f99,stroke:#900

Elementos del modelo¶

Compensation boundary event atado a una activity: define qué hacer para "deshacer" esa activity.
Compensation activity (service task con isForCompensation=true): el código de la compensación.
Compensation throw event: trigger que dispara la cadena de compensaciones.

<bpmn:serviceTask id="bookFlight" name="Book flight">
  <bpmn:extensionElements>
    <zeebe:taskDefinition type="book-flight"/>
  </bpmn:extensionElements>
</bpmn:serviceTask>

<bpmn:boundaryEvent id="compBoundary1" attachedToRef="bookFlight">
  <bpmn:compensateEventDefinition/>
</bpmn:boundaryEvent>

<bpmn:serviceTask id="cancelFlight" isForCompensation="true">
  <bpmn:extensionElements>
    <zeebe:taskDefinition type="cancel-flight"/>
  </bpmn:extensionElements>
</bpmn:serviceTask>

<bpmn:association associationDirection="One"
                  sourceRef="compBoundary1"
                  targetRef="cancelFlight"/>

<bpmn:intermediateThrowEvent id="triggerComp">
  <bpmn:compensateEventDefinition activityRef=""/>  <!-- empty = compensate all -->
</bpmn:intermediateThrowEvent>

Semántica de ejecución¶

Cuando una activity termina exitosamente, registra un subscription de compensation con su boundary handler.

CREATE TABLE compensation_subscriptions (
    process_instance_key BIGINT,
    element_instance_key BIGINT,
    completed_at TIMESTAMPTZ NOT NULL,
    scope_key BIGINT NOT NULL,
    handler_element_id TEXT NOT NULL,  -- el "cancelFlight"
    handler_variables JSONB,  -- snapshot de variables al momento del complete
    PRIMARY KEY (process_instance_key, element_instance_key)
);

Cuando se dispara compensation throw:
Si tiene activityRef: compensa solo esa activity.
Si está vacío: compensa todo el scope actual en orden inverso al completion.

-- Lookup: qué compensar
SELECT * FROM compensation_subscriptions
WHERE process_instance_key = $1
  AND scope_key = $2
ORDER BY completed_at DESC;  -- LIFO

Ejecución de cada compensation handler:
Es un service task normal (worker pull).
Recibe las variables del snapshot al momento del completion.
Si la compensation falla: ese path queda como incident (no se ejecutan los siguientes a menos que se resuelva).

Reglas clave (frecuente fuente de bugs)¶

Solo activities completadas son candidatas a compensar. Una activity activa al momento del throw NO se compensa.
Variables snapshot: la compensation ve las variables que tenía al completar, no las actuales.
No re-entrante: una activity sólo se compensa una vez por throw event.
Subprocess compensation: throwing en un subprocess sólo compensa elementos dentro de ese subprocess. Para process-wide, usar throw en main flow.
No timeout: compensation activities no tienen timer boundary. Si quedan colgadas → incident.

Combinación: error + compensation¶

Pattern saga típico:

flowchart TD
    BF[Book flight] --> BH[Book hotel] --> BC[Book car] --> CO[Confirm]
    BF -.->|comp| CF[Cancel flight]
    BH -.->|comp| CH[Cancel hotel]
    BC -.->|comp| CC[Cancel car]
    BC -.->|error code='payment-failed'| TC["Throw compensate<br/>(no activityRef)"]
    TC --> EF[End: failed]

Implementación:

Book car falla con BPMN error payment-failed.
Engine encuentra boundary error event en el subprocess raíz que dispara Throw compensate.
Throw compensate activa compensation handlers en orden LIFO:
Cancel car (último completado — pero esperá, falló, así que NO se compensa).
Cancel hotel.
Cancel flight.
Cuando todos terminan → llega a End: failed.

Patterns útiles¶

Retry-then-error¶

Para fallas técnicas con backoff antes de boundary error:

flowchart LR
    A["Service task<br/>taskDefinition retries=3, backoff=expo"] -->|retries=0| B["boundary error 'max-retries'"]
    B --> C[notify ops]

Saga compensation con confirmación¶

Algunos sistemas requieren "confirmation" post-compensation:

flowchart TD
    BF[Book flight] --> BH[Book hotel] --> BC[Book car] --> P[Pay] --> SC[Send confirmation]
    P -.->|error='payment-failed'| TC[Throw compensate]
    TC --> LIFO[Comps run in LIFO order]
    LIFO --> SFN[Send failure notification]
    SFN --> E[End]

Compensation con timeout¶

Si compensation puede colgarse, agregar timer boundary al compensation handler:

flowchart LR
    A[Cancel flight] -.->|timer 5min| B[Manual review]

Testing semántico¶

Tests obligatorios para garantizar implementación correcta:

func TestCompensationOrderLIFO(t *testing.T) {
    // bookFlight, bookHotel, bookCar todos completan
    // throw compensate
    // assert: cancelCar runs first, then cancelHotel, then cancelFlight
}

func TestUncompletedActivityNotCompensated(t *testing.T) {
    // bookFlight completes, bookHotel is active
    // throw compensate
    // assert: cancelFlight runs, cancelHotel does NOT
}

func TestErrorBubblingToParent(t *testing.T) {
    // subprocess raises error, no handler in subprocess
    // assert: error bubbles to parent, handler there fires
}

func TestVariablesSnapshotPreserved(t *testing.T) {
    // bookFlight completes with flightId=X
    // process variables change to flightId=Y
    // compensate → cancelFlight receives flightId=X
}

Métricas operacionales¶

wf_engine_bpmn_errors_thrown_total{process_id, error_code}
wf_engine_bpmn_errors_caught_total{process_id, error_code, handler_scope}
wf_engine_bpmn_errors_uncaught_total{process_id}  -- alerta!
wf_engine_compensations_triggered_total{process_id}
wf_engine_compensations_executed_total{process_id, handler_element}
wf_engine_compensations_failed_total{process_id, handler_element}
wf_engine_incidents_created_total{error_type}
wf_engine_incidents_resolved_total{error_type, resolution_time_bucket}

Alerta clave: rate(wf_engine_bpmn_errors_uncaught_total[5m]) > 0.01 → bug en modelo (error sin handler).

Edge cases tricky¶

Compensation en multi-instance: cada child compensa por separado, en LIFO.
Error de un compensation handler: NO dispara otra compensation. Va a incident.
Compensation en event subprocess: se ejecuta en el contexto del event sub, no del main flow.
Error en parallel branch: solo interrumpe esa branch. Para terminar todo, usar Terminate End.
BPMN error vs incident: BPMN error es declarado (boundary handler existe); incident es uncaught.

Referencias¶

BPMN 2.0 spec — chapter 10 Process
Camunda 8 compensation docs
bpmn execution model — modelo general
incident management — UI/UX de incidents
retry backoff — semántica de retry
bpmn coverage matrix — fase de compensation
error handling patterns — patterns end-to-end

Compensación y BPMN error handling¶

Modelo mental: 3 mecanismos distintos¶

BPMN Error Boundary Event¶

Diagrama (mental)¶

Semántica de ejecución¶

Implementación (Postgres state)¶

Reglas de bubbling¶

Variables al error¶

Incidents¶

Cuándo se crea un incident¶

Lifecycle¶

Resolución típica (Operate UI / wf CLI)¶

Compensation¶

Caso de uso: saga distribuida¶

Elementos del modelo¶

Semántica de ejecución¶

Reglas clave (frecuente fuente de bugs)¶

Combinación: error + compensation¶

Patterns útiles¶

Retry-then-error¶

Saga compensation con confirmación¶

Compensation con timeout¶

Testing semántico¶

Métricas operacionales¶

Edge cases tricky¶

Referencias¶

Resolución típica (Operate UI / `wf` CLI)¶