Implementation Roadmap Concrete

Plan concreto de 26 semanas para construir el MVP (Phase 0) + 6 semanas adicionales para Phase 1 (HA). Cada semana tiene deliverables claros, validation criteria, y risk indicators. Team sizing: 1 backend engineer + 1 frontend engineer + 0.5 platform engineer = ~2.5 FTE. Total: ~6.5 meses al MVP funcional, ~8 meses a production-ready Phase 1. Milestones: M1 (week 4) skeleton + first BPMN executes, M2 (week 12) full engine + Tasklist, M3 (week 20) production hardening, M4 (week 26) MVP ship, M5 (week 32) Phase 1 HA.

Team composition asumida¶

Rol	FTE	Responsibilidades
Backend engineer (lead)	1.0	Engine core, API, processors
Frontend engineer	1.0	Tasklist, Process Inspector, CLI
Platform engineer	0.5	Infra, CI/CD, observability, deploy
Tech lead / architect	0.25	Reviews, decisions, planning
QA / SRE	0.25	Testing, runbooks, on-call
Total	3.0 FTE

Smaller team possible (1 backend FTE) extending timeline ~50%.

Milestones overview¶

Milestone	Week	Deliverable	Validation
M1	4	Skeleton runs, simple BPMN executes	Hello-world process: start → service task → end
M2	12	Full engine + Tasklist functional	80% test BPMN suite passes
M3	20	Production hardening complete	Load test: 100 TPS sustained 4h
M4	26	MVP SHIPPED (Phase 0)	First customer / internal use
M5	32	Phase 1 HA running	99.9% SLA validated

Roadmap detallado¶

Weeks 1-4: Foundation (M1 milestone)¶

Goal: Skeleton funcional. First BPMN ejecuta end-to-end.

Week 1: Setup + DB schema¶

Backend: - [ ] Repo setup (Go + Bun/Vite frontend, structure per self review critical gaps) - [ ] CI/CD básico (lint, test, build) - [ ] Postgres setup local (Docker Compose) - [ ] Initial migrations: - tenants, users, user_tenants, api_keys - process_definitions, deployments - command_log, event_log (with pg_partman) - process_instances, element_instances, variables - jobs, user_tasks, incidents, timers

Platform: - [ ] Infrastructure-as-code (Terraform/Pulumi) - [ ] Dev/staging environments - [ ] Observability setup (OTel collector → local Grafana)

Validation: pg_dump succeeds, migrations apply cleanly, basic SELECT 1 from engine to DB.

Week 2: API skeleton + Auth¶

Backend: - [ ] REST framework setup (Gin/Fiber/Chi for Go) - [ ] OpenAPI 3.0 spec started - [ ] Auth middleware: - OIDC JWT validation - API key validation - Tenant resolution from token - [ ] Endpoints stubs: - POST /v2/deployments - POST /v2/process-instances - GET /v2/process-instances/{key} - [ ] Per-ADR-024: Postgres RLS policies activated - [ ] Per-ADR-025: audit_log table + write helper

Frontend: - [ ] React/Vue app skeleton - [ ] Auth flow with Auth0/Keycloak local - [ ] Layout (sidebar, header, routing)

Validation: curl -H "Authorization: Bearer <jwt>" /v2/process-instances returns 200 (empty list).

Week 3: Command log + simple processing¶

Backend: - [ ] Command log INSERT en API - [ ] Engine processing loop (single thread, async) - [ ] LISTEN/NOTIFY signal for new commands - [ ] Simple BPMN parser (start/end/service task only) - [ ] First processor: ProcessInstanceCreationCreateProcessor - [ ] Event log writes (atomic with state)

Tests: - [ ] Replay determinism test scaffolding (ADR-019) - [ ] First property-based test (generate random simple BPMN, verify completion)

Validation:

mvp-cli deploy hello.bpmn
mvp-cli create-instance hello
# → Engine creates instance, persists to DB

Week 4: Service task + workers (M1)¶

Backend: - [ ] Service task processor - [ ] Job creation on ACTIVATE_ELEMENT - [ ] Jobs activation endpoint with SKIP LOCKED (per job queue fairness) - [ ] Job complete/fail endpoints - [ ] Element completion / next sequence flow

SDK: - [ ] Worker SDK skeleton (Go + TypeScript) - [ ] Hello-world example worker

Validation: end-to-end test:

# Hello world process
deploy('process.bpmn')  # start → service task "say-hello" → end

pi = create_instance('hello')

# Worker
job = activate_jobs('say-hello', maxJobs=1)[0]
complete_job(job.key, {"message": "Hello, world!"})

# Verify
assert get_instance(pi.key).state == 'COMPLETED'

🎯 M1 reached: Skeleton works end-to-end.

Weeks 5-12: Full engine + Tasklist (M2 milestone)¶

Goal: Most BPMN features work. Tasklist functional.

Weeks 5-6: Gateways + Sequence Flows¶

Backend: - [ ] Exclusive gateway processor (with CEL expressions per ADR-022) - [ ] Parallel gateway processor (fork + join) - [ ] Default sequence flow - [ ] Condition evaluation (CEL integration) - [ ] Variable scoping (per variable scoping)

Tests: - [ ] Property-based: random gateways + paths

Validation: BPMN with if amount > 100 works correctly.

Weeks 7-8: User tasks + Tasklist UI¶

Backend: - [ ] User task processor (CREATE, ASSIGN, COMPLETE) - [ ] User task search endpoint - [ ] Claim atomic via UPDATE ... RETURNING - [ ] Forms storage + retrieval

Frontend (Tasklist): - [ ] Mis tasks list - [ ] Disponibles list - [ ] Task detail page - [ ] Claim/unclaim actions - [ ] JSON Schema form rendering (rjsf) - [ ] Complete with variables

Validation: human approval flow works end-to-end.

Weeks 9-10: Timers + Messages¶

Backend: - [ ] Timer processor (per timer recovery postgres) - [ ] DueDate scheduler (in-memory min-heap) - [ ] Recovery scan on startup - [ ] Message correlation (per message correlation) - [ ] Boundary timer events - [ ] Intermediate timer events - [ ] Message start events - [ ] Message intermediate catch events

Validation: Process with "wait 1 hour then continue" works. Message from API correlates to waiting process.

Weeks 11-12: Incidents + Inspector (M2)¶

Backend: - [ ] Incident creation on job_no_retries - [ ] Incident resolution endpoint - [ ] Incident on FEEL evaluation failure - [ ] Process Inspector API (search, filters)

Frontend (Process Inspector): - [ ] Search by key/business ID - [ ] List with filters - [ ] Detail modal - [ ] Variables modal - [ ] Resolve incident - [ ] Cancel instance

🎯 M2 reached: Engine covers 80% of BPMN. Tasklist + Inspector functional.

Weeks 13-20: Production hardening (M3 milestone)¶

Goal: Production-ready. Comprehensive testing. Observability.

Week 13: Backpressure + rate limiting¶

Backend: - [ ] Per-tenant rate limiter (token bucket per backpressure rest strategy) - [ ] Engine queue depth monitoring - [ ] HTTP 429/503 responses - [ ] Whitelisted critical commands

Validation: Load test sees rate limiting kick in at configured threshold.

Week 14: Process definition cache¶

Backend: - [ ] LRU cache for ExecutableProcess (per process definition cache) - [ ] Invalidation via LISTEN/NOTIFY - [ ] Latest version cache (TTL) - [ ] Cache metrics

Validation: Cache hit rate > 95% in load test.

Week 15: Idempotency + worker SDK improvements¶

Backend: - [ ] Idempotency keys (per api engine serialization) - [ ] Activation count tracking

SDK: - [ ] Worker SDK: idempotency helpers - [ ] Polyglot: Python SDK - [ ] Retry/backoff built-in - [ ] Health checks - [ ] OTel integration in SDK

Validation: Re-submitting same request twice doesn't create duplicate.

Week 16: Observability complete¶

Platform: - [ ] All business metrics emitted (per adr 011 opentelemetry instrumentation) - [ ] Distributed tracing setup - [ ] Log structuring (JSON, no sensitive data per security model) - [ ] Grafana dashboards (5 main per adr 009 skip optimize use grafana) - [ ] Alert rules (Prometheus Alertmanager)

Validation: Dashboard shows real-time TPS, latency, queue depth.

Week 17: Error handling + Edge cases¶

Backend: - [ ] Error boundary events - [ ] Throw error (worker → BPMN) - [ ] Retry/backoff logic in JobFailProcessor - [ ] Per-instance event count limit (anti-infinite-loop)

Validation: Process with error handler catches and routes correctly.

Week 18: CLI tool + admin operations¶

Tool: - [ ] mvp-cli complete: - process-instances list/get/cancel/variables - incidents list/resolve - jobs list/cancel - tenants create/suspend/resume - api-keys create/list/rotate/delete - deploy - batch operations

Validation: All ops user stories doable via CLI.

Week 19: Storage management¶

Backend: - [ ] pg_partman setup for command_log + event_log - [ ] Retention policies per tenant - [ ] Archival to S3 (cron job) - [ ] Cleanup of old state

Validation: 30-day retention test (synthetic 30 days of data, verify archival).

Week 20: Comprehensive testing (M3)¶

QA: - [ ] Load test 100 TPS sustained 4 hours - [ ] Failover test (kill engine, verify recovery) - [ ] Chaos test (network partitions, DB hiccups) - [ ] Security test (per security threat model threats) - [ ] Multi-tenant isolation test - [ ] Replay determinism property tests (1000 iterations)

🎯 M3 reached: Production-ready Phase 0.

Weeks 21-26: Polish + ship (M4 milestone)¶

Goal: Ship to first user.

Week 21: Documentation¶

Doc team (engineers writing): - [ ] Getting started guide - [ ] BPMN supported subset reference - [ ] API documentation (auto-generated from OpenAPI) - [ ] Worker SDK guides (TS, Python, Go) - [ ] Operations runbook (per failure mode analysis) - [ ] Migration guide from Camunda 8

Week 22: BPMN compatibility testing¶

QA: - [ ] Test corpus: 50+ real BPMN files - [ ] Conversion tool: Camunda FEEL → CEL - [ ] Document supported/unsupported elements - [ ] Migration tool: import Camunda 8 process definitions

Validation: 80% of test corpus deploys + executes correctly.

Week 23: First user deployment¶

Platform: - [ ] Production environment provisioned - [ ] Production deployment automation - [ ] Monitoring + alerts in production - [ ] Backup strategy verified (pgBackRest → S3) - [ ] DR drill (restore from backup, validate)

User onboarding: - [ ] First user deploys their workflow - [ ] Provide support - [ ] Iterate on UX feedback

Week 24: Performance tuning¶

Backend based on production observation: - [ ] Identified slow queries → add indexes - [ ] Cache hit rates → tune sizes - [ ] Connection pool → adjust if needed - [ ] Job activation latency → optimize query

Validation: TP99 < 1s sustained in production.

Week 25: Bug fixes + UX polish¶

All team: - [ ] Bug bash week - [ ] Polish UI (Tasklist + Inspector) - [ ] Improve error messages - [ ] Documentation gaps filled

Week 26: Ship (M4)¶

Launch: - [ ] Public announcement / launch - [ ] On-call rotation established - [ ] Customer support channels ready - [ ] Retrospective

🎯 M4 reached: MVP SHIPPED.

Weeks 27-32: Phase 1 (HA) — M5 milestone¶

Goal: 99.9% SLA via HA.

Week 27-28: Patroni setup¶

Platform: - [ ] Patroni + etcd cluster (per adr 020 patroni postgres ha) - [ ] 3-node Postgres setup - [ ] PgBouncer in front - [ ] Failover testing

Week 29: Application changes for HA¶

Backend: - [ ] Retry logic in DB client (handles failover) - [ ] Connection string rotation - [ ] Read replica reads for non-critical queries (Tasklist, Inspector) - [ ] Lag monitoring

Week 30: HA testing¶

QA: - [ ] Kill primary, verify failover < 30s - [ ] Replication lag tests - [ ] Split-brain prevention tests - [ ] Test PITR recovery

Week 31: Monitoring + runbooks¶

Platform: - [ ] Patroni dashboards - [ ] Failover runbook - [ ] DR plan documented + drilled - [ ] On-call procedures

Week 32: Production HA rollout (M5)¶

Migration: - [ ] Migrate production to Patroni cluster (zero-downtime) - [ ] Validate SLA over 2 weeks

🎯 M5 reached: Phase 1 HA operational.

Critical path risks¶

Risk 1: BPMN parser complexity underestimated¶

Probability: High (BPMN spec is complex) Impact: Slip by 2-4 weeks Mitigation: - Strict scope (only supported elements per ADR-001 subset) - Reuse libraries (bpmn-go if Go, bpmn-server reference) - Time-box parser work (max 3 weeks total)

Risk 2: Replay determinism bugs slow development¶

Probability: Medium Impact: Slip by 2 weeks Mitigation: - Property tests from week 3 (early detection) - Strict coding rules (no time.Now() direct) - Code reviews enforce determinism

Risk 3: Tasklist UX rework needed¶

Probability: Medium Impact: Slip 1-2 weeks Mitigation: - User testing in week 8 (catch issues early) - Reuse design system / component library - Defer advanced features to Phase 2

Risk 4: Performance not meeting target¶

Probability: Medium Impact: 1-2 weeks tuning Mitigation: - Profile early (week 16) - Cache decisions early (per ADR analysis) - Engineer mindfully (per microbenchmark methodology)

Risk 5: Security gaps discovered late¶

Probability: Medium
Impact: 1-2 weeks remediation Mitigation: - Threat model review in week 12 (M2) - External pentest before M4 ship - Security checklist enforced

Validation criteria per milestone¶

M1 (Week 4) — Skeleton¶

Engine runs locally via make dev
One BPMN process deploys + executes
Service task → worker → completion works
State persists to Postgres
Tests passing in CI

M2 (Week 12) — Functional¶

BPMN coverage: start/end events, service/user tasks, exclusive/parallel gateways, timers, messages
Tasklist UI functional
Process Inspector UI functional
Incidents work
80% of test BPMN corpus passes

M3 (Week 20) — Production-ready¶

100 TPS sustained 4h without degradation
Failover (engine restart) < 30s
TP99 < 1s
All security threats mitigated (per threat model)
Replay determinism tests passing
Comprehensive observability
DR plan documented

M4 (Week 26) — Shipped¶

First real user using in production
No P0 bugs open
Documentation complete
Support channels established

M5 (Week 32) — HA¶

Postgres HA via Patroni
99.9% SLA achieved in monitoring
DR drill successful

Effort estimates¶

Phase	Engineer-weeks	Calendar weeks
Foundation (M1)	8	4
Core engine (M2)	24	8
Hardening (M3)	24	8
Ship (M4)	18	6
HA (M5)	18	6
Total to M4	74	26
Total to M5	92	32

74 engineer-weeks / 3 engineers ≈ 25 calendar weeks (~ 6 months) - aligns with M4 timeline (slight buffer).

Beyond M5: roadmap¶

Phase 1 (M5) is just the beginning. Future phases:

Phase 2 (months 9-12): Active-active engines, multi-tenancy strict, SDK polish, more BPMN elements
Phase 3 (months 13-18): Tenant sharding strategy, SaaS deployment, billing
Phase 4 (months 19+): Citus migration, multi-region, advanced features

Links¶

blueprint plataforma simplificada — Detailed spec
scaling strategy postgres — Phase 0 → Phase 6 roadmap
index — Architectural decisions
security threat model — Threats mitigated by week 20
failure mode analysis — Failures mitigated by week 18-20
self review critical gaps — Gaps closed in weeks 13-15