Saltar a contenido

Self Review Critical Gaps

Meta-análisis crítico del wiki. Si fuera un staff engineer asignado a construir este MVP, ¿qué preguntas haría? ¿Dónde está la spec incompleta? Esta página lista las preguntas, da respuestas breves, e identifica los 6 gaps que requieren deep-dive separado. Cubre: backpressure sin gRPC, process definition caching, timer recovery sin snapshots, FEEL expressions strategy, command log compaction, API serialization con single-threaded engine.

Las preguntas críticas

Si me das este wiki y dices "construilo", estas son las preguntas que tendría INMEDIATAMENTE:

🔴 Gaps críticos (deep dive en página separada)

1. ¿Cómo funciona backpressure SIN gRPC streaming?

Camunda usa StabilizingAIMD + gRPC flow control nativo. El MVP usa REST. ¿Cómo limito requests cuando el engine se satura? Sin esto, un cliente puede tumbar el sistema.

concepts/backpressure-rest-strategy

2. ¿Cómo se cachea el BPMN parsed?

Cada CreateInstance necesita el BPMN parsed (ExecutableProcess). Re-parsear XML cada vez = ~10ms overhead per request. Camunda tiene process_cache column family. ¿Qué hace el MVP?

concepts/process-definition-cache

3. ¿Cómo sobreviven los timers a un restart?

Engine se reinicia. Los timers in-memory se pierden. Camunda usa snapshots de RocksDB. El MVP no tiene snapshots — ¿cómo recupera timers pendientes al startup?

concepts/timer-recovery-postgres

4. ¿Qué hago con FEEL expressions?

BPMN usa FEEL para conditions: =amount > 100 and customer.tier == "premium". Camunda usa feel-scala (huge library, complex). ¿Build, reuse, o subset? Decisión major.

analysis/feel-expressions-strategy

5. ¿Cómo evito que el command_log crezca infinitamente?

1M instances × 50 events × 1KB = 50GB de events solo. Sin compaction explícita, esto se vuelve catastrófico. Camunda compactiona después de snapshots. ¿Strategy del MVP?

concepts/command-log-compaction

6. ¿Cómo serializo API multi-threaded con engine single-threaded?

REST API recibe 100 requests concurrentes. Engine procesa single-threaded. ¿Cola en memoria? ¿Postgres advisory lock? ¿Worker pool? ¿Cómo se garantiza orden?

concepts/api-engine-serialization

🟡 Gaps secundarios (respondidos brevemente abajo)

  1. ¿Multi-instance loops?
  2. ¿Subprocesses + call activities?
  3. ¿Error boundary events flow end-to-end?
  4. ¿Schema migration de Postgres?
  5. ¿Performance testing setup?
  6. ¿Code organization recomendado?
  7. ¿Sync vs async API (createWithAwaitingResult)?
  8. ¿Cómo manejo el caso "worker tarda más que timeout"?
  9. ¿Qué pasa con BPMN inválido?
  10. ¿KPIs cuantitativos para validar el MVP?

Respuestas breves a gaps secundarios

7. Multi-instance loops

BPMN soporta multi-instance (sequential o parallel) — ejecutar un elemento N veces con collection.

MVP approach (Phase 1): - Soportar parallel multi-instance sobre service tasks (most common) - Schema: element_instances con parent_scope_key + loop_index - Cada iteration es un element_instance separado bajo el multi-instance body - Completion condition evaluada después de cada child completion

-- Multi-instance body
INSERT INTO element_instances (
    element_instance_key, parent_scope_key, element_id, 
    element_type, state, multi_instance_completed, multi_instance_total
) VALUES (..., 'MULTI_INSTANCE_BODY', 'ACTIVATING', 0, 10);

-- Cada child iteration
INSERT INTO element_instances (
    element_instance_key, parent_scope_key, element_id,
    element_type, state, loop_index, loop_variable_value
) VALUES (..., $body_key, 'send-email', 'SERVICE_TASK', 'ACTIVE', 0, '{"email":"a@x.com"}');

Sequential multi-instance: defer a Phase 2 (más complejo, menos usado).

8. Subprocesses + call activities

Embedded sub-process: - Element instance con element_type = 'SUB_PROCESS' - Children con parent_scope_key = sub_process_key - Variables locales al scope del sub-process

Call activity: - Crea un nuevo process_instance con parent_process_instance_key = calling_pi - Input/output mappings para variables - Cuando child completa, parent reanuda

Schema ya soporta esto vía parent_scope_key y parent_process_instance_key. No cambia diseño core.

9. Error boundary events flow

End-to-end:

flowchart TD
    Worker[Worker calls ThrowError errorCode=GOOD_UNAVAILABLE]
    Worker --> Cmd[Engine recibe JOB:THROW_ERROR command]
    Cmd --> Find[Find element_instance del job service task]
    Find --> Lookup[Look up boundary events on this task]
    Lookup --> Match{Match errorCode?}
    Match -->|Match| Terminate[Terminate service task element]
    Terminate --> Activate[Activate boundary event's outgoing flow]
    Match -->|No match| Propagate[Propagate up scope chain]
    Propagate --> Root{Reaches process root?}
    Root -->|Sí| Term[TERMINATE process instance<br/>Emit ERROR_THROWN event no incident]
    Root -->|No| Lookup

Implementación: cada element processor maneja ERROR signal propagation. Documentar en concept page error-propagation.

10. Schema migration

Postgres migrations standard: - Flyway o Liquibase (Java-friendly) - migrate o dbmate (Go-friendly)
- alembic (Python) - node-pg-migrate (TypeScript)

Each migration es SQL file numbered. Engine startup verifica migration version, applies pending.

db/migrations/
├── V001__initial_schema.sql
├── V002__add_business_id_index.sql
├── V003__add_tenant_isolation_rls.sql
└── ...

Importante: events viejos en el log deben seguir siendo replay-able. Si event schema cambia: - Mantener old event types as legacy classes - Engine handles old format during replay - Eventually migrate old events (offline tool)

Compatible con adrs/adr-019-replay-determinism-invariant.

11. Performance testing setup

mvp-load-tester/
├── scenarios/
│   ├── basic-throughput.yaml      (create + complete simple process)
│   ├── tsunami.yaml               (Intuit-style sustained load)
│   ├── incident-recovery.yaml     (incidents + resolution)
│   └── multi-tenant.yaml          (multiple tenants concurrent)
├── workers/
│   └── stub-worker.py             (fake worker, configurable latency)
└── grafana/
    └── load-test-dashboard.json

Tools options: - k6 (JavaScript, popular) - Locust (Python, distributed) - Gatling (Scala, sophisticated)

Recomendado: k6 por simplicity + good observability.

Targets validation per ADR-003: - 100 TPS sustained creation - 500-1000 TPS jobs - TP99 < 1s end-to-end - No degradation over 4 hours

12. Code organization recomendado

Repo structure:

workflow-engine/
├── core/                       # Engine logic
│   ├── command-log/
│   ├── event-log/
│   ├── stream-processor/       # Single-threaded actor
│   ├── bpmn-parser/
│   ├── element-processors/     # ~15-20 processors
│   ├── behaviors/              # ~6-8 behaviors
│   ├── state/                  # State tables access
│   └── schemas/                # Migrations + SQL
├── api/                        # REST API
│   ├── server/
│   ├── handlers/
│   ├── middleware/
│   └── openapi.yaml
├── inspector/                  # Process Inspector UI
├── tasklist/                   # Tasklist UI
├── cli/                        # Ops CLI
├── sdk/                        # Worker SDKs
│   ├── typescript/
│   ├── python/
│   └── go/
├── observability/              # OTel instrumentation
├── load-tester/                # Performance tests
├── docs/                       # User docs
├── docs/architecture/          # ADRs, blueprint, etc.
├── deploy/
│   ├── docker/
│   ├── kubernetes/
│   ├── helm/
│   └── grafana/                # Dashboards JSON
└── README.md

Languages recomendados: - Go: best fit (concurrency, performance, deployment) - Rust: alternative si performance critical - TypeScript/Bun: faster MVP iteration - Java/Kotlin: si team viene de Camunda

Mi recomendación: Go para engine + TypeScript para webapps. Familiar split, mature tools.

13. Sync vs async API

Camunda tiene CreateProcessInstance (async) y CreateProcessInstanceWithAwaitingResult (sync, espera completion).

MVP: - Default async: response inmediato con processInstanceKey - Optional sync via header: Prefer: respond-async=false + Prefer: wait=30s - Server holds request until process completes (or timeout) - Implementation: register callback en element completion, SSE/long-poll

POST /v2/process-instances
Prefer: wait=30s
{ "bpmnProcessId": "approve-order", "variables": {...} }

Returns either: - 200 OK con final variables (completed in time) - 202 Accepted con processInstanceKey (timeout, completion async)

14. "Worker tarda más que timeout"

Race condition real:

sequenceDiagram
    participant E as Engine
    participant A as Worker A
    participant B as Worker B

    Note over E,A: T=0: Job assigned to Worker A, timeout=30s
    A->>A: T=10: starts processing (slow API call)
    Note over E,A: T=30: Timeout expires
    E->>B: T=30: Re-assigns to Worker B
    B-->>E: T=35: Completes (success)
    A-->>E: T=45: Finally completes (DUPLICATE submission)

Mitigations: 1. Worker idempotency (ADR-007) — fundamental 2. Job versioning: cada activation increments version, completion includes version, engine rejects stale 3. Optimistic completion: engine accepts first completion, ignores duplicates 4. Worker self-check: before submitting result, ask "am I still owner?"

Schema:

ALTER TABLE jobs ADD COLUMN activation_count BIGINT DEFAULT 0;

-- On activation
UPDATE jobs SET activation_count = activation_count + 1, ... 

-- On completion
UPDATE jobs SET state = 'COMPLETED' 
WHERE job_key = $1 AND activation_count = $2  -- match version

Vieja completion no matchea → rejected → no duplicate processing.

15. BPMN inválido

Validation steps en deployment: 1. XSD validation vs BPMN 2.0 schema (XML well-formed + structure) 2. Semantic validation: - Process has start event - Sequence flows reference valid elements - Gateway has conditions if exclusive - Service tasks have job type 3. MVP-specific validation: - Only supported element types - Only supported event types - FEEL expressions parse correctly 4. Reject with detailed errors:

400 Bad Request
{
  "errors": [
    { "element": "task_check", "message": "Service task missing zeebe:taskDefinition" },
    { "element": "gw_decision", "message": "Exclusive gateway has no default flow" }
  ]
}

Validation library: bpmn-js puede validar también, reusable.

16. KPIs cuantitativos del MVP

Para validar success:

Métrica Target Phase 0 Cómo medir
Functional: BPMN coverage Service task + User task + Gateway + Timer + Message Property-based tests pasan
Performance: TPS sustained creation 100 k6 load test 4h
Performance: Job TPS 500 k6 worker simulation
Performance: TP99 latency < 1s OTel metrics
Reliability: Uptime 99% Health check pings
Reliability: Data loss 0 Replay determinism test
Storage: Per-instance < 50 KB avg Postgres size queries
Compatibility: Camunda BPMN 80% of test models work Test corpus
Developer UX: Worker SDK < 10 LOC for "hello world" worker Code review
Ops UX: Setup time < 30 min from git clone to running Onboarding doc

Conclusion del self-review

Gaps que requieren deep-dive (6)

  1. ✅ Backpressure sin gRPC → concepts/backpressure-rest-strategy
  2. ✅ Process definition cache → concepts/process-definition-cache
  3. ✅ Timer recovery → concepts/timer-recovery-postgres
  4. ✅ FEEL expressions → analysis/feel-expressions-strategy
  5. ✅ Log compaction → concepts/command-log-compaction
  6. ✅ API serialization → concepts/api-engine-serialization

Gaps respondidos breves (en esta página)

7-16: documentadas above con strategy resumen.

Próximas iteraciones futuras (potential)

  • Multi-region deployment patterns detailed (Phase 5 spec)
  • DR / disaster recovery procedures runbook
  • Security threat model (STRIDE analysis)
  • Performance tuning guide (Postgres + engine)
  • Migration guide desde Camunda 8 específico

Pero esto es beyond MVP scope — defer hasta que sea needed.

Self-critique de los ADRs existentes

Re-leyendo los 21 ADRs, identifico estos potential issues:

  1. ADR-002 (PostgreSQL único): ¿He validated que 200 TPS es alcanzable? Falta benchmark concreto del schema propuesto.

  2. ADR-006 (Single-threaded): ¿Cómo escala el throughput dentro de single-thread? Profiling needed.

  3. ADR-010 (Hybrid monitoring): cost projection asume APM en self-hosted. Si SaaS APM, costs scale differently.

  4. ADR-018 (Build Tasklist): scope sigue siendo ~8K LOC. ¿Hay forma de reducirlo más?

Estos son refinements, no flaws fundamentales. Documentar en future revisions de cada ADR si necesario.

Conclusión

El wiki tenía gaps reales que un implementer would hit en día 1. Esta self-review identificó 6 críticos + 10 secundarios. Los 6 críticos requieren páginas dedicadas — siguiendo en próximas commits.

Esto es ejemplo del beneficio de la metodología Karpathy: revisar y profundizar continuamente mejora la calidad del wiki para uso real.