Saltar a contenido

Blueprint Plataforma Simplificada

Especificación técnica para construir un workflow engine simplificado que actúe como drop-in replacement de Camunda 8 para casos de uso comunes. Diseñado para ser implementado con Claude Code.

Principios de Diseño

  1. Single-node first: sin clustering ni Raft hasta que se necesite. PostgreSQL como state store.
  2. Command sourcing: mantener el patrón probado de Camunda — commands al log, procesamiento determinístico.
  3. REST-first API: compatible con la API de Camunda /v2/ para facilitar migración.
  4. BPMN subset: soportar los elementos que cubren el 95% de los casos de uso.
  5. Composition over inheritance: processors con behaviors inyectados.
  6. SQL-queryable state: sin Elasticsearch — PostgreSQL para todo (state + search).

Arquitectura General

flowchart LR
    subgraph Engine["Workflow Engine"]
        API[REST API OpenAPI] --> Processor[Command Processor single-threaded]
        Processor --> State[(State Store Postgres/SQLite)]
        Processor --> JobPoller[Job Poller REST]
        Processor --> EventLog[(Event Log append-only)]
        TimerSched[Timer Scheduler] --> Processor
        BpmnParser[BPMN Parser & Transformer] --> Processor
    end

Módulo 1: Command Log (Event Store)

Lección de Camunda

Camunda usa un LogStream append-only replicado via Raft, con ApplicationEntry batches y FlowControl backpressure.

Implementación Simplificada

Table: command_log

CREATE TABLE command_log (
    position     BIGSERIAL PRIMARY KEY,
    record_type  TEXT NOT NULL,  -- 'COMMAND', 'EVENT', 'REJECTION'
    value_type   TEXT NOT NULL,  -- 'PROCESS_INSTANCE', 'JOB', 'MESSAGE', etc.
    intent       TEXT NOT NULL,  -- 'ACTIVATE_ELEMENT', 'COMPLETE', etc.
    key          BIGINT NOT NULL,
    timestamp    TIMESTAMPTZ NOT NULL DEFAULT now(),
    value        JSONB NOT NULL,
    source_position BIGINT,
    partition_id INT NOT NULL DEFAULT 1
);
CREATE INDEX idx_log_unprocessed ON command_log(position) WHERE record_type = 'COMMAND';

Procesamiento: un goroutine/thread lee commands no procesados secuencialmente, genera events, los inserta en la misma tabla dentro de una transacción.

Por qué PostgreSQL: WAL nativo provee durabilidad equivalente al Raft log para single-node. LISTEN/NOTIFY puede reemplazar LogRecordAwaiter. Backup via pg_dump o streaming replication para HA.

Diferencias clave con Camunda

  • Sin SBE — JSON (JSONB en PostgreSQL)
  • Sin partitioning — single partition en MVP
  • Sin Raft — PostgreSQL streaming replication para HA
  • Backpressure: rate limiting simple en el API layer

Módulo 2: State Store

Lección de Camunda

142+ column families en RocksDB con type-safe access via ColumnFamily<K, V>.

Implementación Simplificada

Tables principales (PostgreSQL):

-- Process definitions
CREATE TABLE process_definitions (
    key BIGINT PRIMARY KEY,
    bpmn_process_id TEXT NOT NULL,
    version INT NOT NULL,
    name TEXT,
    bpmn_xml TEXT NOT NULL,
    checksum TEXT NOT NULL,
    tenant_id TEXT NOT NULL DEFAULT '<default>',
    deployed_at TIMESTAMPTZ NOT NULL,
    UNIQUE(bpmn_process_id, version, tenant_id)
);

-- Process instances
CREATE TABLE process_instances (
    key BIGINT PRIMARY KEY,
    process_definition_key BIGINT NOT NULL REFERENCES process_definitions(key),
    bpmn_process_id TEXT NOT NULL,
    version INT NOT NULL,
    state TEXT NOT NULL DEFAULT 'ACTIVE',  -- ACTIVE, COMPLETED, CANCELED
    start_date TIMESTAMPTZ NOT NULL,
    end_date TIMESTAMPTZ,
    parent_instance_key BIGINT,
    tenant_id TEXT NOT NULL DEFAULT '<default>',
    business_id TEXT
);

-- Element instances (flow node instances)
CREATE TABLE element_instances (
    key BIGINT PRIMARY KEY,
    process_instance_key BIGINT NOT NULL,
    process_definition_key BIGINT NOT NULL,
    element_id TEXT NOT NULL,      -- BPMN element ID
    element_type TEXT NOT NULL,     -- SERVICE_TASK, EXCLUSIVE_GATEWAY, etc.
    state TEXT NOT NULL,            -- ACTIVATING, ACTIVATED, COMPLETING, COMPLETED, TERMINATING, TERMINATED
    flow_scope_key BIGINT,         -- parent element instance
    created_at TIMESTAMPTZ NOT NULL,
    completed_at TIMESTAMPTZ,
    tenant_id TEXT NOT NULL DEFAULT '<default>'
);
CREATE INDEX idx_ei_process ON element_instances(process_instance_key);
CREATE INDEX idx_ei_scope ON element_instances(flow_scope_key) WHERE state NOT IN ('COMPLETED', 'TERMINATED');

-- Variables
CREATE TABLE variables (
    key BIGINT PRIMARY KEY,
    name TEXT NOT NULL,
    value JSONB NOT NULL,
    process_instance_key BIGINT NOT NULL,
    scope_key BIGINT NOT NULL,  -- element instance key
    tenant_id TEXT NOT NULL DEFAULT '<default>',
    UNIQUE(name, scope_key)
);
CREATE INDEX idx_var_scope ON variables(scope_key);

-- Jobs
CREATE TABLE jobs (
    key BIGINT PRIMARY KEY,
    type TEXT NOT NULL,
    process_instance_key BIGINT NOT NULL,
    element_instance_key BIGINT NOT NULL,
    element_id TEXT NOT NULL,
    state TEXT NOT NULL DEFAULT 'CREATED',  -- CREATED, ACTIVATED, COMPLETED, FAILED, ERROR_THROWN, TIMED_OUT
    retries INT NOT NULL DEFAULT 3,
    worker TEXT,
    deadline TIMESTAMPTZ,
    backoff_until TIMESTAMPTZ,
    custom_headers JSONB,
    variables JSONB,
    error_message TEXT,
    error_code TEXT,
    tenant_id TEXT NOT NULL DEFAULT '<default>',
    created_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_jobs_activatable ON jobs(type, tenant_id) WHERE state = 'CREATED';
CREATE INDEX idx_jobs_deadline ON jobs(deadline) WHERE state = 'ACTIVATED';
CREATE INDEX idx_jobs_backoff ON jobs(backoff_until) WHERE state = 'FAILED' AND backoff_until IS NOT NULL;

-- Timers
CREATE TABLE timers (
    key BIGINT PRIMARY KEY,
    process_instance_key BIGINT,
    element_instance_key BIGINT,
    process_definition_key BIGINT,
    element_id TEXT NOT NULL,
    due_date TIMESTAMPTZ NOT NULL,
    repetitions INT DEFAULT 0,
    state TEXT NOT NULL DEFAULT 'ACTIVE',
    tenant_id TEXT NOT NULL DEFAULT '<default>'
);
CREATE INDEX idx_timers_due ON timers(due_date) WHERE state = 'ACTIVE';

-- Messages (for correlation)
CREATE TABLE messages (
    key BIGINT PRIMARY KEY,
    name TEXT NOT NULL,
    correlation_key TEXT NOT NULL,
    message_id TEXT,
    variables JSONB,
    time_to_live INTERVAL,
    deadline TIMESTAMPTZ,
    state TEXT NOT NULL DEFAULT 'PUBLISHED',
    tenant_id TEXT NOT NULL DEFAULT '<default>',
    UNIQUE(name, correlation_key, message_id) -- dedup
);
CREATE INDEX idx_msg_correlate ON messages(name, correlation_key, tenant_id) WHERE state = 'PUBLISHED';

-- Message subscriptions
CREATE TABLE message_subscriptions (
    key BIGINT PRIMARY KEY,
    message_name TEXT NOT NULL,
    correlation_key TEXT NOT NULL,
    process_instance_key BIGINT NOT NULL,
    element_instance_key BIGINT NOT NULL,
    element_id TEXT NOT NULL,
    state TEXT NOT NULL DEFAULT 'ACTIVE',
    tenant_id TEXT NOT NULL DEFAULT '<default>'
);
CREATE INDEX idx_msub_correlate ON message_subscriptions(message_name, correlation_key, tenant_id) WHERE state = 'ACTIVE';

-- Incidents
CREATE TABLE incidents (
    key BIGINT PRIMARY KEY,
    process_instance_key BIGINT NOT NULL,
    element_instance_key BIGINT NOT NULL,
    job_key BIGINT,
    error_type TEXT NOT NULL,  -- JOB_NO_RETRIES, UNHANDLED_ERROR_EVENT, EXTRACT_VALUE_ERROR
    error_message TEXT,
    state TEXT NOT NULL DEFAULT 'ACTIVE',
    created_at TIMESTAMPTZ NOT NULL,
    resolved_at TIMESTAMPTZ,
    tenant_id TEXT NOT NULL DEFAULT '<default>'
);

-- Key generator
CREATE SEQUENCE key_generator START 1;

Diferencias con Camunda

  • SQL queries directas en vez de exporter + ES
  • Sin column families — tablas SQL con índices
  • Sin DbCompositeKey — claves naturales o foreign keys
  • Sin TransactionContext wrapper — transacciones PostgreSQL nativas

Módulo 3: BPMN Parser & Transformer

Lección de Camunda

5-step visitor pipeline: instantiation → attributes → context-dependent → container-aware → multi-instance wrapping.

Implementación Simplificada

Parsing: usar una library XML existente para parsear BPMN 2.0 XML. Producir un modelo in-memory.

Supported Elements (MVP): - process (with startEvent, endEvent) - serviceTask → creates jobs - userTask → creates user tasks - exclusiveGateway → condition evaluation - parallelGateway → fork/join - subProcess (embedded) - callActivity - intermediateCatchEvent (timer, message) - boundaryEvent (timer, error) - endEvent (none, error, terminate) - sequenceFlow with conditions

Transformation (2 pasos en vez de 5): 1. Parse: XML → in-memory model con todos los elementos y sus atributos 2. Link: resolver references (outgoing flows, default flows, error refs), validar, producir ExecutableProcess

Expression Language: subset de FEEL o usar CEL (Common Expression Language): - Literals, comparisons, arithmetic - String interpolation - Context access (variable lookup) - Basic functions (contains, matches, now, date)


Módulo 4: Processing Engine

Lección de Camunda

BpmnStreamProcessorBpmnElementProcessors (EnumMap) → 24 processors con 12 behaviors.

Implementación Simplificada

Element Processors (12 en vez de 24):

interface ElementProcessor {
    onActivate(context: ProcessingContext): Result
    onComplete(context: ProcessingContext): Result
    onTerminate(context: ProcessingContext): Result
}

interface ContainerProcessor extends ElementProcessor {
    onChildCompleted(context, childContext): Result
    onChildTerminated(context, childContext): Result
}
Element Processor Notas
PROCESS ProcessProcessor Container root
SERVICE_TASK ServiceTaskProcessor Creates job on activate
USER_TASK UserTaskProcessor Creates user task record
EXCLUSIVE_GATEWAY ExclusiveGatewayProcessor Evaluate conditions, take first true
PARALLEL_GATEWAY ParallelGatewayProcessor Fork: activate all outgoing. Join: count incoming.
SUB_PROCESS SubProcessProcessor Container with scoped variables
CALL_ACTIVITY CallActivityProcessor Creates child process instance
START_EVENT StartEventProcessor Activates first flow node
END_EVENT EndEventProcessor Completes flow scope
BOUNDARY_EVENT BoundaryEventProcessor Timer + error only
INTERMEDIATE_CATCH IntermediateCatchProcessor Timer + message
SEQUENCE_FLOW SequenceFlowProcessor Activates target element

Behaviors (6 en vez de 12): - VariableBehavior — input/output mapping, scoping - JobBehavior — create/complete/fail jobs - IncidentBehavior — create/resolve incidents - TimerBehavior — schedule/cancel timers - MessageBehavior — create/correlate message subscriptions - StateBehavior — state transitions, child tracking

Lifecycle State Machine (mantener el de Camunda):

stateDiagram-v2
    [*] --> ACTIVATING
    ACTIVATING --> ACTIVATED
    ACTIVATED --> COMPLETING
    COMPLETING --> COMPLETED
    ACTIVATING --> TERMINATING
    ACTIVATED --> TERMINATING
    COMPLETING --> TERMINATING
    TERMINATING --> TERMINATED
    COMPLETED --> [*]
    TERMINATED --> [*]


Módulo 5: Job Worker System

Lección de Camunda

gRPC streaming + long polling, timeout, retries, backoff, whitelisted commands.

Implementación Simplificada

REST API (compatible con Camunda v2):

POST /v2/jobs/activation
  Body: { type, worker, timeout, maxJobsToActivate, fetchVariables, requestTimeout }
  Response: { jobs: [...] }
  Long polling: if no jobs available, hold connection up to requestTimeout

POST /v2/jobs/{key}/completion
  Body: { variables }

POST /v2/jobs/{key}/failure
  Body: { retries, retryBackOff, errorMessage }

POST /v2/jobs/{key}/error
  Body: { errorCode, errorMessage, variables }

Implementación: - Job activation: SELECT ... FROM jobs WHERE type = ? AND state = 'CREATED' AND tenant_id IN (?) LIMIT ? FOR UPDATE SKIP LOCKED - Long polling: PostgreSQL LISTEN/NOTIFY en channel jobs_{type} - Timeout checker: goroutine/thread periódico que marca jobs expirados como TIMED_OUT - Backoff: WHERE backoff_until <= now() en el activation query


Módulo 6: Timer System

Lección de Camunda

DueDateCheckScheduler demand-driven con yielding. 100ms resolución.

Implementación Simplificada

Timer Scheduler: un goroutine/thread que: 1. SELECT MIN(due_date) FROM timers WHERE state = 'ACTIVE' 2. Sleep hasta ese timestamp (o max 1 segundo) 3. SELECT ... FROM timers WHERE due_date <= now() AND state = 'ACTIVE' ORDER BY due_date LIMIT 100 4. Para cada timer: generar command (TRIGGER), actualizar timer (si repetitivo, calcular next due date) 5. Yield después de 50 timers para no bloquear el event loop

Cron support: library cron del lenguaje elegido para parsear expresiones cron.


Módulo 7: Message Correlation

Lección de Camunda

Cross-partition routing, dedup por messageId, TTL, subscription matching.

Implementación Simplificada (single-partition)

-- Publish message
INSERT INTO messages (key, name, correlation_key, ...) ...
  ON CONFLICT (name, correlation_key, message_id) DO NOTHING;  -- dedup

-- Try correlate immediately
SELECT ms.* FROM message_subscriptions ms
WHERE ms.message_name = ? AND ms.correlation_key = ? AND ms.state = 'ACTIVE'
LIMIT 1 FOR UPDATE;

-- If match: activate the waiting element, mark subscription as correlated
-- If no match: message waits in buffer until TTL expires or subscription arrives

-- TTL cleanup
DELETE FROM messages WHERE deadline < now();

Módulo 8: REST API (Compatible con Camunda v2)

Endpoints Principales

# Process definitions
POST   /v2/process-definitions/search
GET    /v2/process-definitions/{key}
GET    /v2/process-definitions/{key}/xml
DELETE /v2/process-definitions/{key}

# Deployments
POST   /v2/deployments               (multipart: BPMN files)

# Process instances
POST   /v2/process-instances/creation
POST   /v2/process-instances/search
GET    /v2/process-instances/{key}
DELETE /v2/process-instances/{key}    (cancel)

# Jobs
POST   /v2/jobs/activation
POST   /v2/jobs/{key}/completion
POST   /v2/jobs/{key}/failure
POST   /v2/jobs/{key}/error

# User tasks
POST   /v2/user-tasks/search
GET    /v2/user-tasks/{key}
POST   /v2/user-tasks/{key}/assignment
DELETE /v2/user-tasks/{key}/assignee
POST   /v2/user-tasks/{key}/completion
PATCH  /v2/user-tasks/{key}

# Messages
POST   /v2/messages/publication

# Incidents
POST   /v2/incidents/search
POST   /v2/incidents/{key}/resolution

# Variables
POST   /v2/process-instances/{key}/variables
POST   /v2/variables/search

# Topology (simplified)
GET    /v2/topology

Módulo 9: Monitoring (Simplificado)

En vez de Operate + Tasklist

Opción A — API-only: todas las queries vía REST API. UI como proyecto separado.

Opción B — Dashboard simple: - Single-page app con tabla de process instances (filtro por state, process ID) - Drill-down: ver flow nodes, variables, incidents de una instancia - BPMN diagram viewer con highlighting del estado actual (usar bpmn-js) - Task list integrada

Queries SQL directas reemplazan a Elasticsearch:

-- Process instances with incidents
SELECT pi.*, COUNT(i.key) as incident_count
FROM process_instances pi
LEFT JOIN incidents i ON i.process_instance_key = pi.key AND i.state = 'ACTIVE'
WHERE pi.state = 'ACTIVE'
GROUP BY pi.key;

-- Active tasks
SELECT j.* FROM jobs j
WHERE j.state IN ('CREATED', 'ACTIVATED')
AND j.type = 'user-task'
ORDER BY j.created_at DESC;


Plan de Implementación (Fases)

Fase 1 — Core Engine (2-3 semanas con Claude Code)

  1. Schema PostgreSQL + migraciones
  2. BPMN parser (XML → modelo in-memory)
  3. Key generator
  4. Command processor (single-threaded loop)
  5. Element processors: Process, StartEvent, EndEvent, ServiceTask, SequenceFlow, ExclusiveGateway
  6. Variable store + scoping
  7. Job system: create, activate, complete, fail
  8. REST API: deploy, create instance, jobs

Fase 2 — Resilience (1-2 semanas)

  1. Timer system (boundary + intermediate catch)
  2. Incident management
  3. Job retries + backoff
  4. Message correlation
  5. Parallel gateway (fork + join)
  6. Sub-process + call activity

Fase 3 — Completeness (1-2 semanas)

  1. User tasks (assignment, completion, forms API)
  2. Error events (throw/catch)
  3. Expression language (CEL o FEEL subset)
  4. Multi-instance (parallel)
  5. Boundary events (timer, error)

Fase 4 — Production Readiness (1-2 semanas)

  1. Authentication (API keys + OIDC)
  2. Basic authorization (roles)
  3. Multi-tenancy (tenant_id filtering)
  4. Monitoring dashboard
  5. Metrics (OpenTelemetry)
  6. Backup/restore

Fase 5 — Scale (cuando se necesite)

  1. Clustering (PostgreSQL streaming replication o Raft)
  2. Partitioning
  3. Exporter system (para analytics)
  4. Advanced BPMN elements

Problemas No-Obvios que Camunda Ya Resolvió

Estos son los problemas que encontrarás durante la implementación y que Camunda ya tiene solución:

  1. Parallel gateway join con terminación parcial: cuando una rama se termina (por boundary event o error) antes de llegar al join. Camunda maneja esto con counters en el event scope — necesitas decrementar el expected count.

  2. Variable scope shadowing en sub-processes: variables con el mismo nombre en scopes anidados. El scope más interno gana para lectura, pero ¿dónde escribir en job completion? Camunda: al primer scope donde existe, o al root si es nueva.

  3. Timer rescheduling drift: si un timer repetitivo se ancla al momento de disparo, acumula drift. Camunda lo ancla a la due date anterior.

  4. Message dedup con TTL: un mensaje puede expirar entre el publish y la correlación. Necesitas un check de deadline en el momento de correlación, no solo en el cleanup.

  5. Incident resolution re-processing: al resolver un incident, necesitas re-ejecutar EXACTAMENTE el comando que falló. Camunda reconstruye el comando desde el estado del element instance.

  6. Process deployment duplicate detection: si el mismo BPMN se deploya dos veces, no crear versión nueva. Camunda usa checksum del contenido.

  7. Job timeout race condition: un worker completa un job justo cuando el timeout lo marca como TIMED_OUT. Necesitas FOR UPDATE SKIP LOCKED o similar.

  8. Cross-scope compensation: compensation handlers en sub-processes necesitan acceso a variables del scope donde se creó la subscription, no del scope actual.

  9. Event sub-process interruption: un interrupting event sub-process debe terminar TODOS los elementos activos en el parent scope antes de activarse. Orden de terminación importa.

  10. Call activity variable isolation: variables del parent NO se propagan automáticamente al child. Solo las que están en input mappings. Esto es intencional para encapsulación.