Self Review Critical Gaps
Meta-análisis crítico del wiki. Si fuera un staff engineer asignado a construir este MVP, ¿qué preguntas haría? ¿Dónde está la spec incompleta? Esta página lista las preguntas, da respuestas breves, e identifica los 6 gaps que requieren deep-dive separado. Cubre: backpressure sin gRPC, process definition caching, timer recovery sin snapshots, FEEL expressions strategy, command log compaction, API serialization con single-threaded engine.
Las preguntas críticas¶
Si me das este wiki y dices "construilo", estas son las preguntas que tendría INMEDIATAMENTE:
🔴 Gaps críticos (deep dive en página separada)¶
1. ¿Cómo funciona backpressure SIN gRPC streaming?¶
Camunda usa StabilizingAIMD + gRPC flow control nativo. El MVP usa REST. ¿Cómo limito requests cuando el engine se satura? Sin esto, un cliente puede tumbar el sistema.
→ concepts/backpressure-rest-strategy
2. ¿Cómo se cachea el BPMN parsed?¶
Cada CreateInstance necesita el BPMN parsed (ExecutableProcess). Re-parsear XML cada vez = ~10ms overhead per request. Camunda tiene process_cache column family. ¿Qué hace el MVP?
→ concepts/process-definition-cache
3. ¿Cómo sobreviven los timers a un restart?¶
Engine se reinicia. Los timers in-memory se pierden. Camunda usa snapshots de RocksDB. El MVP no tiene snapshots — ¿cómo recupera timers pendientes al startup?
→ concepts/timer-recovery-postgres
4. ¿Qué hago con FEEL expressions?¶
BPMN usa FEEL para conditions: =amount > 100 and customer.tier == "premium". Camunda usa feel-scala (huge library, complex). ¿Build, reuse, o subset? Decisión major.
→ analysis/feel-expressions-strategy
5. ¿Cómo evito que el command_log crezca infinitamente?¶
1M instances × 50 events × 1KB = 50GB de events solo. Sin compaction explícita, esto se vuelve catastrófico. Camunda compactiona después de snapshots. ¿Strategy del MVP?
→ concepts/command-log-compaction
6. ¿Cómo serializo API multi-threaded con engine single-threaded?¶
REST API recibe 100 requests concurrentes. Engine procesa single-threaded. ¿Cola en memoria? ¿Postgres advisory lock? ¿Worker pool? ¿Cómo se garantiza orden?
→ concepts/api-engine-serialization
🟡 Gaps secundarios (respondidos brevemente abajo)¶
- ¿Multi-instance loops?
- ¿Subprocesses + call activities?
- ¿Error boundary events flow end-to-end?
- ¿Schema migration de Postgres?
- ¿Performance testing setup?
- ¿Code organization recomendado?
- ¿Sync vs async API (createWithAwaitingResult)?
- ¿Cómo manejo el caso "worker tarda más que timeout"?
- ¿Qué pasa con BPMN inválido?
- ¿KPIs cuantitativos para validar el MVP?
Respuestas breves a gaps secundarios¶
7. Multi-instance loops¶
BPMN soporta multi-instance (sequential o parallel) — ejecutar un elemento N veces con collection.
MVP approach (Phase 1):
- Soportar parallel multi-instance sobre service tasks (most common)
- Schema: element_instances con parent_scope_key + loop_index
- Cada iteration es un element_instance separado bajo el multi-instance body
- Completion condition evaluada después de cada child completion
-- Multi-instance body
INSERT INTO element_instances (
element_instance_key, parent_scope_key, element_id,
element_type, state, multi_instance_completed, multi_instance_total
) VALUES (..., 'MULTI_INSTANCE_BODY', 'ACTIVATING', 0, 10);
-- Cada child iteration
INSERT INTO element_instances (
element_instance_key, parent_scope_key, element_id,
element_type, state, loop_index, loop_variable_value
) VALUES (..., $body_key, 'send-email', 'SERVICE_TASK', 'ACTIVE', 0, '{"email":"a@x.com"}');
Sequential multi-instance: defer a Phase 2 (más complejo, menos usado).
8. Subprocesses + call activities¶
Embedded sub-process:
- Element instance con element_type = 'SUB_PROCESS'
- Children con parent_scope_key = sub_process_key
- Variables locales al scope del sub-process
Call activity:
- Crea un nuevo process_instance con parent_process_instance_key = calling_pi
- Input/output mappings para variables
- Cuando child completa, parent reanuda
Schema ya soporta esto vía parent_scope_key y parent_process_instance_key. No cambia diseño core.
9. Error boundary events flow¶
End-to-end:
flowchart TD
Worker[Worker calls ThrowError errorCode=GOOD_UNAVAILABLE]
Worker --> Cmd[Engine recibe JOB:THROW_ERROR command]
Cmd --> Find[Find element_instance del job service task]
Find --> Lookup[Look up boundary events on this task]
Lookup --> Match{Match errorCode?}
Match -->|Match| Terminate[Terminate service task element]
Terminate --> Activate[Activate boundary event's outgoing flow]
Match -->|No match| Propagate[Propagate up scope chain]
Propagate --> Root{Reaches process root?}
Root -->|Sí| Term[TERMINATE process instance<br/>Emit ERROR_THROWN event no incident]
Root -->|No| Lookup
Implementación: cada element processor maneja ERROR signal propagation. Documentar en concept page error-propagation.
10. Schema migration¶
Postgres migrations standard:
- Flyway o Liquibase (Java-friendly)
- migrate o dbmate (Go-friendly)
- alembic (Python)
- node-pg-migrate (TypeScript)
Each migration es SQL file numbered. Engine startup verifica migration version, applies pending.
db/migrations/
├── V001__initial_schema.sql
├── V002__add_business_id_index.sql
├── V003__add_tenant_isolation_rls.sql
└── ...
Importante: events viejos en el log deben seguir siendo replay-able. Si event schema cambia: - Mantener old event types as legacy classes - Engine handles old format during replay - Eventually migrate old events (offline tool)
Compatible con adrs/adr-019-replay-determinism-invariant.
11. Performance testing setup¶
mvp-load-tester/
├── scenarios/
│ ├── basic-throughput.yaml (create + complete simple process)
│ ├── tsunami.yaml (Intuit-style sustained load)
│ ├── incident-recovery.yaml (incidents + resolution)
│ └── multi-tenant.yaml (multiple tenants concurrent)
├── workers/
│ └── stub-worker.py (fake worker, configurable latency)
└── grafana/
└── load-test-dashboard.json
Tools options: - k6 (JavaScript, popular) - Locust (Python, distributed) - Gatling (Scala, sophisticated)
Recomendado: k6 por simplicity + good observability.
Targets validation per ADR-003: - 100 TPS sustained creation - 500-1000 TPS jobs - TP99 < 1s end-to-end - No degradation over 4 hours
12. Code organization recomendado¶
Repo structure:
workflow-engine/
├── core/ # Engine logic
│ ├── command-log/
│ ├── event-log/
│ ├── stream-processor/ # Single-threaded actor
│ ├── bpmn-parser/
│ ├── element-processors/ # ~15-20 processors
│ ├── behaviors/ # ~6-8 behaviors
│ ├── state/ # State tables access
│ └── schemas/ # Migrations + SQL
├── api/ # REST API
│ ├── server/
│ ├── handlers/
│ ├── middleware/
│ └── openapi.yaml
├── inspector/ # Process Inspector UI
├── tasklist/ # Tasklist UI
├── cli/ # Ops CLI
├── sdk/ # Worker SDKs
│ ├── typescript/
│ ├── python/
│ └── go/
├── observability/ # OTel instrumentation
├── load-tester/ # Performance tests
├── docs/ # User docs
├── docs/architecture/ # ADRs, blueprint, etc.
├── deploy/
│ ├── docker/
│ ├── kubernetes/
│ ├── helm/
│ └── grafana/ # Dashboards JSON
└── README.md
Languages recomendados: - Go: best fit (concurrency, performance, deployment) - Rust: alternative si performance critical - TypeScript/Bun: faster MVP iteration - Java/Kotlin: si team viene de Camunda
Mi recomendación: Go para engine + TypeScript para webapps. Familiar split, mature tools.
13. Sync vs async API¶
Camunda tiene CreateProcessInstance (async) y CreateProcessInstanceWithAwaitingResult (sync, espera completion).
MVP:
- Default async: response inmediato con processInstanceKey
- Optional sync via header: Prefer: respond-async=false + Prefer: wait=30s
- Server holds request until process completes (or timeout)
- Implementation: register callback en element completion, SSE/long-poll
POST /v2/process-instances
Prefer: wait=30s
{ "bpmnProcessId": "approve-order", "variables": {...} }
Returns either: - 200 OK con final variables (completed in time) - 202 Accepted con processInstanceKey (timeout, completion async)
14. "Worker tarda más que timeout"¶
Race condition real:
sequenceDiagram
participant E as Engine
participant A as Worker A
participant B as Worker B
Note over E,A: T=0: Job assigned to Worker A, timeout=30s
A->>A: T=10: starts processing (slow API call)
Note over E,A: T=30: Timeout expires
E->>B: T=30: Re-assigns to Worker B
B-->>E: T=35: Completes (success)
A-->>E: T=45: Finally completes (DUPLICATE submission)
Mitigations: 1. Worker idempotency (ADR-007) — fundamental 2. Job versioning: cada activation increments version, completion includes version, engine rejects stale 3. Optimistic completion: engine accepts first completion, ignores duplicates 4. Worker self-check: before submitting result, ask "am I still owner?"
Schema:
ALTER TABLE jobs ADD COLUMN activation_count BIGINT DEFAULT 0;
-- On activation
UPDATE jobs SET activation_count = activation_count + 1, ...
-- On completion
UPDATE jobs SET state = 'COMPLETED'
WHERE job_key = $1 AND activation_count = $2 -- match version
Vieja completion no matchea → rejected → no duplicate processing.
15. BPMN inválido¶
Validation steps en deployment: 1. XSD validation vs BPMN 2.0 schema (XML well-formed + structure) 2. Semantic validation: - Process has start event - Sequence flows reference valid elements - Gateway has conditions if exclusive - Service tasks have job type 3. MVP-specific validation: - Only supported element types - Only supported event types - FEEL expressions parse correctly 4. Reject with detailed errors:
400 Bad Request
{
"errors": [
{ "element": "task_check", "message": "Service task missing zeebe:taskDefinition" },
{ "element": "gw_decision", "message": "Exclusive gateway has no default flow" }
]
}
Validation library: bpmn-js puede validar también, reusable.
16. KPIs cuantitativos del MVP¶
Para validar success:
| Métrica | Target Phase 0 | Cómo medir |
|---|---|---|
| Functional: BPMN coverage | Service task + User task + Gateway + Timer + Message | Property-based tests pasan |
| Performance: TPS sustained creation | 100 | k6 load test 4h |
| Performance: Job TPS | 500 | k6 worker simulation |
| Performance: TP99 latency | < 1s | OTel metrics |
| Reliability: Uptime | 99% | Health check pings |
| Reliability: Data loss | 0 | Replay determinism test |
| Storage: Per-instance | < 50 KB avg | Postgres size queries |
| Compatibility: Camunda BPMN | 80% of test models work | Test corpus |
| Developer UX: Worker SDK | < 10 LOC for "hello world" worker | Code review |
| Ops UX: Setup time | < 30 min from git clone to running |
Onboarding doc |
Conclusion del self-review¶
Gaps que requieren deep-dive (6)¶
- ✅ Backpressure sin gRPC → concepts/backpressure-rest-strategy
- ✅ Process definition cache → concepts/process-definition-cache
- ✅ Timer recovery → concepts/timer-recovery-postgres
- ✅ FEEL expressions → analysis/feel-expressions-strategy
- ✅ Log compaction → concepts/command-log-compaction
- ✅ API serialization → concepts/api-engine-serialization
Gaps respondidos breves (en esta página)¶
7-16: documentadas above con strategy resumen.
Próximas iteraciones futuras (potential)¶
- Multi-region deployment patterns detailed (Phase 5 spec)
- DR / disaster recovery procedures runbook
- Security threat model (STRIDE analysis)
- Performance tuning guide (Postgres + engine)
- Migration guide desde Camunda 8 específico
Pero esto es beyond MVP scope — defer hasta que sea needed.
Self-critique de los ADRs existentes¶
Re-leyendo los 21 ADRs, identifico estos potential issues:
-
ADR-002 (PostgreSQL único): ¿He validated que 200 TPS es alcanzable? Falta benchmark concreto del schema propuesto.
-
ADR-006 (Single-threaded): ¿Cómo escala el throughput dentro de single-thread? Profiling needed.
-
ADR-010 (Hybrid monitoring): cost projection asume APM en self-hosted. Si SaaS APM, costs scale differently.
-
ADR-018 (Build Tasklist): scope sigue siendo ~8K LOC. ¿Hay forma de reducirlo más?
Estos son refinements, no flaws fundamentales. Documentar en future revisions de cada ADR si necesario.
Conclusión¶
El wiki tenía gaps reales que un implementer would hit en día 1. Esta self-review identificó 6 críticos + 10 secundarios. Los 6 críticos requieren páginas dedicadas — siguiendo en próximas commits.
Esto es ejemplo del beneficio de la metodología Karpathy: revisar y profundizar continuamente mejora la calidad del wiki para uso real.