Process modification at runtime¶
Operaciones disruptivas sobre process instances en vuelo: cancelar elements, activar elements arbitrarios, set variables, modificar el flow. Para incident resolution y migración compleja. Cuidado, complejidad y rules.
El problema¶
A veces los procesos llegan a estados inválidos: - Bug en el modelo BPMN llevó al proceso a una rama incorrecta. - Variable mal seteada bloquea el progreso. - Cambio externo (orden cancelada externamente) requiere skip elements. - Process definition se actualizó y queremos mover instancias activas a la nueva versión.
Reiniciar el proceso desde cero = pérdida de progreso, doble effect (charges, emails). Modification permite intervenir sin reiniciar.
Modification vs Migration vs Cancel¶
| Operación | Qué hace | Cuándo |
|---|---|---|
| Cancel | Termina instance | "Esta instance no continúa" |
| Modification | Mueve token a otro element, set vars | "Bug, salté a recovery" |
| Migration | Cambia process definition version | "Deploy nueva versión, migrar in-flight" |
Modification es la más invasiva. Requiere mayor privilegio (admin role).
Capabilities¶
1. Cancel element instance¶
Cancela un token activo en un elemento específico:
Caso: una rama paralela se atascó en boundary timer infinito. Cancela solo esa rama, deja otra rama correr.
2. Activate element¶
Crea un token en un elemento, ignorando la lógica de flow normal:
Caso: el proceso debió pasar por "send-notification" pero por un bug saltó. Operator manualmente activa "send-notification".
# Activate con variables específicas
wf instance modify <instance-key> \
--activate-element send-notification \
--variables '{"reason": "manual catch-up"}'
3. Move token¶
Combinación de cancel + activate, atomic:
wf instance modify <instance-key> \
--move-token-from <source-element> \
--move-token-to <target-element>
4. Set / unset variables¶
# Set a variable in process scope
wf variable set <instance-key> --name customerId --value "cust-42"
# Set in element scope
wf variable set <instance-key> --name temp --value "..." --scope <element-instance>
# Unset
wf variable unset <instance-key> --name temp
5. Resolve incident¶
Reintenta el job que causó el incident:
Si la causa fue una variable mal seteada, combinar con variable set:
Reglas de seguridad¶
Lo que NO está permitido¶
- Saltar barreras de seguridad: no podés activar un element que el flow no permite alcanzar conceptualmente sin atravesar gateways de auth.
- Modificar elementos completados: no podés "des-completar" un task ya done.
- Corromper el command log: toda modification crea commands trazables.
- Skip compensation handlers: si activás compensation throw, todos los handlers se ejecutan (no skip selective).
Lo que requiere extra confirmation¶
- Cancel element: si tiene compensation pendiente.
- Activate event handlers manualmente: workaround a bug del engine.
- Move-token across subprocess boundaries: scope shift complejo.
Reglas semánticas¶
Si target element está dentro de subprocess no activo: error¶
Process root
└── Subprocess "approval" (currently NOT active)
└── Task "manager-review"
# Intentar: --activate-element manager-review
# Error: "Cannot activate element in inactive scope. First activate subprocess."
Si source element no tiene un token activo: error¶
# Intentar: --cancel-element T1 (T1 already completed)
# Error: "Element T1 has no active instance to cancel."
Activate dentro de multi-instance: clarify¶
# MI activity "process-orders" tiene 5 child instances active
# Cancel element <mi-key> → cancela todo el MI
# Cancel element <child-key> → cancela solo 1 child
Implementación¶
Command type¶
type ModifyInstanceCommand struct {
InstanceKey int64
Operations []ModifyOp
Actor string
Reason string // mandatory for audit
}
type ModifyOp struct {
Type ModifyOpType // Cancel, Activate, SetVariable, ...
ElementInstanceKey int64
ElementID string
Variables map[string]any
Scope *int64
}
Atomicidad¶
Todas las operations en un ModifyInstanceCommand se aplican atomic (mismo command). Si una falla, rollback de todas.
BEGIN;
-- Apply ops sequentially
INSERT INTO commands (...); -- audit
UPDATE element_instances SET state='CANCELED' WHERE key = $cancel_key;
INSERT INTO element_instances (state='ACTIVATING', ...) VALUES (...); -- new token
UPDATE variables SET value = $new_value WHERE ...;
COMMIT;
Si crash mid-tx: rollback. Si crash después COMMIT: replay del command produce mismo estado.
Replay determinism¶
Crítico: el command de modify debe replay-able. Por eso:
- Operations son determinísticas (no time.Now() dentro del processing).
- Resultados de cada op deducibles del state + op input.
func applyModifyOp(state State, op ModifyOp) State {
switch op.Type {
case Cancel:
return state.cancelElement(op.ElementInstanceKey)
case Activate:
return state.activateElement(op.ElementID, op.Variables, op.Scope)
case SetVariable:
return state.setVariable(op.ElementInstanceKey, op.Name, op.Value)
}
}
Replay con mismo state + mismo op → mismo resultante state.
API¶
REST¶
POST /api/v1/instances/{key}/modify
{
"reason": "Customer requested manual approval skip",
"operations": [
{ "type": "cancel", "elementInstanceKey": 2251799813685234 },
{ "type": "activate", "elementId": "send-confirmation",
"variables": { "skippedApproval": true } }
]
}
HTTP 200
{
"modificationKey": 2251799813685900,
"appliedOperations": 2,
"newElementInstances": [
{ "key": 2251799813685901, "elementId": "send-confirmation" }
]
}
CLI examples¶
# Single operation
wf instance modify 22518... --activate-element send-email --reason "manual catch-up after bug fix"
# Multiple operations atomically
wf instance modify 22518... \
--cancel-element 22519... \
--activate-element error-recovery-task \
--variable errorReason="manual intervention" \
--reason "Resolving incident #INC-12345"
# From a YAML file
wf instance modify 22518... -f modification.yaml
# modification.yaml
reason: "Resolving incident INC-12345"
operations:
- type: cancel
elementInstanceKey: 22519...
- type: activate
elementId: error-recovery-task
variables:
errorReason: "manual intervention"
Modification vs Migration¶
Both change a running instance, but:
| Aspect | Modification | Migration |
|---|---|---|
| Process definition | Same | Different version |
| Element mapping | Manual ops | Mapping rules |
| Use case | Bug recovery, ad-hoc fix | Version upgrade |
| Frequency | Rare | Possibly batch |
| Risk | High (manual) | Medium (rules-based) |
Migration spec: ver concepts/process-instance-migration.
Audit trail¶
Cada modification queda registrada en audit log:
INSERT INTO audit_log (action, actor, resource, payload, ...)
VALUES ('instance.modify', 'paulo@example.com', 'instance/22518...',
'{
"reason": "...",
"operations": [...],
"before_state_hash": "...",
"after_state_hash": "..."
}', ...);
State hashes permiten verify post-hoc que la modification produjo el estado esperado.
UI workflow (Operate)¶
Instance Detail Page
└── [Modify Instance] button (admin only)
└── Modal:
├── Reason field (required)
├── Operation list:
│ ├── [+ Cancel element] → pick from active elements
│ ├── [+ Activate element] → pick from process def
│ ├── [+ Set variable] → name + JSON value
│ └── [+ Resolve incident]
├── Preview pane: shows resulting state
└── [Confirm Modify] button
Preview shows: - Elements that will be canceled. - New tokens that will be activated. - Variables that will change. - Compensation handlers que se dispararán.
Risks¶
Compensation cascade¶
Cancel element con boundary compensation → handlers fire.
Cancel Task "book-flight" (subscribed to compensation)
→ Triggers compensation handler "cancel-flight"
→ Compensation handler runs (calls Stripe to refund, etc.)
Esto es correcto (semantically) pero operator debe saberlo.
UI: highlight if cancel will trigger compensation.
Variable scope confusion¶
Set variable sin scope → root scope. Si el operator quería local scope, sorpresa.
UI: scope selector con explicación.
Element activation in inactive scope¶
Tried to cover en "reglas semánticas". Engine debe rechazar antes de aplicar.
Race con worker¶
Operator activa element X. Mientras tanto, worker estaba processing job Y que ya canceló X. Race.
Mitigation: modifications van por mismo command queue. Engine procesa secuencialmente per partition. No race.
Permissions¶
Modification es operación high-privilege:
| Role | Permission |
|---|---|
| Worker | None |
| Operator | View modifications (audit), resolve incidents |
| Admin | Full modify |
| Custom: "Incident-Responder" | Resolve incident + minor variable updates |
Audit: cada modification logged with actor.
Bulk modification¶
A veces N instancias requieren misma modification:
# Find affected instances
wf instance list --in-incident --error-type EXTRACT_VALUE_ERROR \
--process order-flow -o keys > affected.txt
# Apply same fix to all
while read key; do
wf instance modify "$key" \
--variable totalAmount=0 \
--resolve-incident \
--reason "Bulk fix for variable schema bug, INC-12345"
done < affected.txt
Para grandes volúmenes (>1000), API bulk:
POST /api/v1/instances/bulk-modify
{
"filter": {
"processDefinitionId": "order-flow",
"state": "INCIDENT",
"errorType": "EXTRACT_VALUE_ERROR"
},
"modification": {
"operations": [...]
},
"reason": "...",
"maxInstances": 10000,
"dryRun": false
}
dryRun=true returns count and sample affected instances sin applicar.
Best practices¶
✅ Document reason¶
Reason mandatory. Para audit + retrospective.
✅ Test en staging primero¶
Modification de una instance prod sin saber el outcome = peligroso. Replicar en staging.
✅ Snapshot antes de bulk¶
# Backup el state
wf snapshot create --tag pre-modify-INC-12345
# Aplicar modification
wf instance bulk-modify ...
# Si algo sale mal, restore selectivo
✅ Coordinar con downstream¶
Si modification dispara workers (e.g., activate "send-email"), avisar al equipo de email para que esperen volumen.
❌ Modification como reemplazo de bug fix¶
Si el mismo bug pega 100×/día, fix el modelo BPMN. Modification es emergency.
❌ Modification sin reason¶
Audit log incomprehensible 6 meses después.
Observabilidad¶
wf_engine_modifications_total{operation_type, actor}
wf_engine_modification_operations_per_request_histogram
wf_engine_modifications_bulk_total{filter_type}
Alertas:
- rate(wf_engine_modifications_total[1h]) > 10 → algo pasando, investigar.
- Daily report de modifications a security team.
Limitations (M1)¶
Algunas operations no implementadas en M1, deferidas:
- ❌ Activate element en multi-instance específico (sólo el padre MI).
- ❌ Modificar event subprocesses (complejo lifecycle).
- ❌ Modify dentro de compensation flow.
- ❌ Bulk modify > 10000 instances (chunked, manual).
Plan M2-M3 para coverage completo.
Roadmap¶
- M1: Cancel, activate, set variable. Single-instance API + CLI.
- M2: Bulk modify API. Operate UI integration.
- M3: Migration (separate from modification) production-ready.
- M4: Cross-tenant modify (admin-only, audit-heavy).
Referencias¶
- concepts/process-instance-migration — version migration
- concepts/incident-management — incidents flow
- adrs/adr-019-replay-determinism-invariant — replay invariant
- adrs/adr-025-audit-logging-mandatory — audit
- analysis/security-threat-model — privilege escalation analysis
- Camunda Modification API