Implementation Roadmap Concrete
Plan concreto de 26 semanas para construir el MVP (Phase 0) + 6 semanas adicionales para Phase 1 (HA). Cada semana tiene deliverables claros, validation criteria, y risk indicators. Team sizing: 1 backend engineer + 1 frontend engineer + 0.5 platform engineer = ~2.5 FTE. Total: ~6.5 meses al MVP funcional, ~8 meses a production-ready Phase 1. Milestones: M1 (week 4) skeleton + first BPMN executes, M2 (week 12) full engine + Tasklist, M3 (week 20) production hardening, M4 (week 26) MVP ship, M5 (week 32) Phase 1 HA.
Team composition asumida¶
| Rol | FTE | Responsibilidades |
|---|---|---|
| Backend engineer (lead) | 1.0 | Engine core, API, processors |
| Frontend engineer | 1.0 | Tasklist, Process Inspector, CLI |
| Platform engineer | 0.5 | Infra, CI/CD, observability, deploy |
| Tech lead / architect | 0.25 | Reviews, decisions, planning |
| QA / SRE | 0.25 | Testing, runbooks, on-call |
| Total | 3.0 FTE |
Smaller team possible (1 backend FTE) extending timeline ~50%.
Milestones overview¶
| Milestone | Week | Deliverable | Validation |
|---|---|---|---|
| M1 | 4 | Skeleton runs, simple BPMN executes | Hello-world process: start → service task → end |
| M2 | 12 | Full engine + Tasklist functional | 80% test BPMN suite passes |
| M3 | 20 | Production hardening complete | Load test: 100 TPS sustained 4h |
| M4 | 26 | MVP SHIPPED (Phase 0) | First customer / internal use |
| M5 | 32 | Phase 1 HA running | 99.9% SLA validated |
Roadmap detallado¶
Weeks 1-4: Foundation (M1 milestone)¶
Goal: Skeleton funcional. First BPMN ejecuta end-to-end.
Week 1: Setup + DB schema¶
Backend:
- [ ] Repo setup (Go + Bun/Vite frontend, structure per analysis/self-review-critical-gaps)
- [ ] CI/CD básico (lint, test, build)
- [ ] Postgres setup local (Docker Compose)
- [ ] Initial migrations:
- tenants, users, user_tenants, api_keys
- process_definitions, deployments
- command_log, event_log (with pg_partman)
- process_instances, element_instances, variables
- jobs, user_tasks, incidents, timers
Platform: - [ ] Infrastructure-as-code (Terraform/Pulumi) - [ ] Dev/staging environments - [ ] Observability setup (OTel collector → local Grafana)
Validation: pg_dump succeeds, migrations apply cleanly, basic SELECT 1 from engine to DB.
Week 2: API skeleton + Auth¶
Backend:
- [ ] REST framework setup (Gin/Fiber/Chi for Go)
- [ ] OpenAPI 3.0 spec started
- [ ] Auth middleware:
- OIDC JWT validation
- API key validation
- Tenant resolution from token
- [ ] Endpoints stubs:
- POST /v2/deployments
- POST /v2/process-instances
- GET /v2/process-instances/{key}
- [ ] Per-ADR-024: Postgres RLS policies activated
- [ ] Per-ADR-025: audit_log table + write helper
Frontend: - [ ] React/Vue app skeleton - [ ] Auth flow with Auth0/Keycloak local - [ ] Layout (sidebar, header, routing)
Validation: curl -H "Authorization: Bearer <jwt>" /v2/process-instances returns 200 (empty list).
Week 3: Command log + simple processing¶
Backend:
- [ ] Command log INSERT en API
- [ ] Engine processing loop (single thread, async)
- [ ] LISTEN/NOTIFY signal for new commands
- [ ] Simple BPMN parser (start/end/service task only)
- [ ] First processor: ProcessInstanceCreationCreateProcessor
- [ ] Event log writes (atomic with state)
Tests: - [ ] Replay determinism test scaffolding (ADR-019) - [ ] First property-based test (generate random simple BPMN, verify completion)
Validation:
Week 4: Service task + workers (M1)¶
Backend: - [ ] Service task processor - [ ] Job creation on ACTIVATE_ELEMENT - [ ] Jobs activation endpoint with SKIP LOCKED (per concepts/job-queue-fairness) - [ ] Job complete/fail endpoints - [ ] Element completion / next sequence flow
SDK: - [ ] Worker SDK skeleton (Go + TypeScript) - [ ] Hello-world example worker
Validation: end-to-end test:
# Hello world process
deploy('process.bpmn') # start → service task "say-hello" → end
pi = create_instance('hello')
# Worker
job = activate_jobs('say-hello', maxJobs=1)[0]
complete_job(job.key, {"message": "Hello, world!"})
# Verify
assert get_instance(pi.key).state == 'COMPLETED'
🎯 M1 reached: Skeleton works end-to-end.
Weeks 5-12: Full engine + Tasklist (M2 milestone)¶
Goal: Most BPMN features work. Tasklist functional.
Weeks 5-6: Gateways + Sequence Flows¶
Backend: - [ ] Exclusive gateway processor (with CEL expressions per ADR-022) - [ ] Parallel gateway processor (fork + join) - [ ] Default sequence flow - [ ] Condition evaluation (CEL integration) - [ ] Variable scoping (per concepts/variable-scoping)
Tests: - [ ] Property-based: random gateways + paths
Validation: BPMN with if amount > 100 works correctly.
Weeks 7-8: User tasks + Tasklist UI¶
Backend: - [ ] User task processor (CREATE, ASSIGN, COMPLETE) - [ ] User task search endpoint - [ ] Claim atomic via UPDATE ... RETURNING - [ ] Forms storage + retrieval
Frontend (Tasklist): - [ ] Mis tasks list - [ ] Disponibles list - [ ] Task detail page - [ ] Claim/unclaim actions - [ ] JSON Schema form rendering (rjsf) - [ ] Complete with variables
Validation: human approval flow works end-to-end.
Weeks 9-10: Timers + Messages¶
Backend: - [ ] Timer processor (per concepts/timer-recovery-postgres) - [ ] DueDate scheduler (in-memory min-heap) - [ ] Recovery scan on startup - [ ] Message correlation (per concepts/message-correlation) - [ ] Boundary timer events - [ ] Intermediate timer events - [ ] Message start events - [ ] Message intermediate catch events
Validation: Process with "wait 1 hour then continue" works. Message from API correlates to waiting process.
Weeks 11-12: Incidents + Inspector (M2)¶
Backend: - [ ] Incident creation on job_no_retries - [ ] Incident resolution endpoint - [ ] Incident on FEEL evaluation failure - [ ] Process Inspector API (search, filters)
Frontend (Process Inspector): - [ ] Search by key/business ID - [ ] List with filters - [ ] Detail modal - [ ] Variables modal - [ ] Resolve incident - [ ] Cancel instance
🎯 M2 reached: Engine covers 80% of BPMN. Tasklist + Inspector functional.
Weeks 13-20: Production hardening (M3 milestone)¶
Goal: Production-ready. Comprehensive testing. Observability.
Week 13: Backpressure + rate limiting¶
Backend: - [ ] Per-tenant rate limiter (token bucket per concepts/backpressure-rest-strategy) - [ ] Engine queue depth monitoring - [ ] HTTP 429/503 responses - [ ] Whitelisted critical commands
Validation: Load test sees rate limiting kick in at configured threshold.
Week 14: Process definition cache¶
Backend: - [ ] LRU cache for ExecutableProcess (per concepts/process-definition-cache) - [ ] Invalidation via LISTEN/NOTIFY - [ ] Latest version cache (TTL) - [ ] Cache metrics
Validation: Cache hit rate > 95% in load test.
Week 15: Idempotency + worker SDK improvements¶
Backend: - [ ] Idempotency keys (per concepts/api-engine-serialization) - [ ] Activation count tracking
SDK: - [ ] Worker SDK: idempotency helpers - [ ] Polyglot: Python SDK - [ ] Retry/backoff built-in - [ ] Health checks - [ ] OTel integration in SDK
Validation: Re-submitting same request twice doesn't create duplicate.
Week 16: Observability complete¶
Platform: - [ ] All business metrics emitted (per adrs/adr-011-opentelemetry-instrumentation) - [ ] Distributed tracing setup - [ ] Log structuring (JSON, no sensitive data per security model) - [ ] Grafana dashboards (5 main per adrs/adr-009-skip-optimize-use-grafana) - [ ] Alert rules (Prometheus Alertmanager)
Validation: Dashboard shows real-time TPS, latency, queue depth.
Week 17: Error handling + Edge cases¶
Backend: - [ ] Error boundary events - [ ] Throw error (worker → BPMN) - [ ] Retry/backoff logic in JobFailProcessor - [ ] Per-instance event count limit (anti-infinite-loop)
Validation: Process with error handler catches and routes correctly.
Week 18: CLI tool + admin operations¶
Tool:
- [ ] mvp-cli complete:
- process-instances list/get/cancel/variables
- incidents list/resolve
- jobs list/cancel
- tenants create/suspend/resume
- api-keys create/list/rotate/delete
- deploy
- batch operations
Validation: All ops user stories doable via CLI.
Week 19: Storage management¶
Backend: - [ ] pg_partman setup for command_log + event_log - [ ] Retention policies per tenant - [ ] Archival to S3 (cron job) - [ ] Cleanup of old state
Validation: 30-day retention test (synthetic 30 days of data, verify archival).
Week 20: Comprehensive testing (M3)¶
QA: - [ ] Load test 100 TPS sustained 4 hours - [ ] Failover test (kill engine, verify recovery) - [ ] Chaos test (network partitions, DB hiccups) - [ ] Security test (per analysis/security-threat-model threats) - [ ] Multi-tenant isolation test - [ ] Replay determinism property tests (1000 iterations)
🎯 M3 reached: Production-ready Phase 0.
Weeks 21-26: Polish + ship (M4 milestone)¶
Goal: Ship to first user.
Week 21: Documentation¶
Doc team (engineers writing): - [ ] Getting started guide - [ ] BPMN supported subset reference - [ ] API documentation (auto-generated from OpenAPI) - [ ] Worker SDK guides (TS, Python, Go) - [ ] Operations runbook (per analysis/failure-mode-analysis) - [ ] Migration guide from Camunda 8
Week 22: BPMN compatibility testing¶
QA: - [ ] Test corpus: 50+ real BPMN files - [ ] Conversion tool: Camunda FEEL → CEL - [ ] Document supported/unsupported elements - [ ] Migration tool: import Camunda 8 process definitions
Validation: 80% of test corpus deploys + executes correctly.
Week 23: First user deployment¶
Platform: - [ ] Production environment provisioned - [ ] Production deployment automation - [ ] Monitoring + alerts in production - [ ] Backup strategy verified (pgBackRest → S3) - [ ] DR drill (restore from backup, validate)
User onboarding: - [ ] First user deploys their workflow - [ ] Provide support - [ ] Iterate on UX feedback
Week 24: Performance tuning¶
Backend based on production observation: - [ ] Identified slow queries → add indexes - [ ] Cache hit rates → tune sizes - [ ] Connection pool → adjust if needed - [ ] Job activation latency → optimize query
Validation: TP99 < 1s sustained in production.
Week 25: Bug fixes + UX polish¶
All team: - [ ] Bug bash week - [ ] Polish UI (Tasklist + Inspector) - [ ] Improve error messages - [ ] Documentation gaps filled
Week 26: Ship (M4)¶
Launch: - [ ] Public announcement / launch - [ ] On-call rotation established - [ ] Customer support channels ready - [ ] Retrospective
🎯 M4 reached: MVP SHIPPED.
Weeks 27-32: Phase 1 (HA) — M5 milestone¶
Goal: 99.9% SLA via HA.
Week 27-28: Patroni setup¶
Platform: - [ ] Patroni + etcd cluster (per adrs/adr-020-patroni-postgres-ha) - [ ] 3-node Postgres setup - [ ] PgBouncer in front - [ ] Failover testing
Week 29: Application changes for HA¶
Backend: - [ ] Retry logic in DB client (handles failover) - [ ] Connection string rotation - [ ] Read replica reads for non-critical queries (Tasklist, Inspector) - [ ] Lag monitoring
Week 30: HA testing¶
QA: - [ ] Kill primary, verify failover < 30s - [ ] Replication lag tests - [ ] Split-brain prevention tests - [ ] Test PITR recovery
Week 31: Monitoring + runbooks¶
Platform: - [ ] Patroni dashboards - [ ] Failover runbook - [ ] DR plan documented + drilled - [ ] On-call procedures
Week 32: Production HA rollout (M5)¶
Migration: - [ ] Migrate production to Patroni cluster (zero-downtime) - [ ] Validate SLA over 2 weeks
🎯 M5 reached: Phase 1 HA operational.
Critical path risks¶
Risk 1: BPMN parser complexity underestimated¶
Probability: High (BPMN spec is complex)
Impact: Slip by 2-4 weeks
Mitigation:
- Strict scope (only supported elements per ADR-001 subset)
- Reuse libraries (bpmn-go if Go, bpmn-server reference)
- Time-box parser work (max 3 weeks total)
Risk 2: Replay determinism bugs slow development¶
Probability: Medium
Impact: Slip by 2 weeks
Mitigation:
- Property tests from week 3 (early detection)
- Strict coding rules (no time.Now() direct)
- Code reviews enforce determinism
Risk 3: Tasklist UX rework needed¶
Probability: Medium Impact: Slip 1-2 weeks Mitigation: - User testing in week 8 (catch issues early) - Reuse design system / component library - Defer advanced features to Phase 2
Risk 4: Performance not meeting target¶
Probability: Medium Impact: 1-2 weeks tuning Mitigation: - Profile early (week 16) - Cache decisions early (per ADR analysis) - Engineer mindfully (per concepts/microbenchmark-methodology)
Risk 5: Security gaps discovered late¶
Probability: Medium
Impact: 1-2 weeks remediation
Mitigation:
- Threat model review in week 12 (M2)
- External pentest before M4 ship
- Security checklist enforced
Validation criteria per milestone¶
M1 (Week 4) — Skeleton¶
- Engine runs locally via
make dev - One BPMN process deploys + executes
- Service task → worker → completion works
- State persists to Postgres
- Tests passing in CI
M2 (Week 12) — Functional¶
- BPMN coverage: start/end events, service/user tasks, exclusive/parallel gateways, timers, messages
- Tasklist UI functional
- Process Inspector UI functional
- Incidents work
- 80% of test BPMN corpus passes
M3 (Week 20) — Production-ready¶
- 100 TPS sustained 4h without degradation
- Failover (engine restart) < 30s
- TP99 < 1s
- All security threats mitigated (per threat model)
- Replay determinism tests passing
- Comprehensive observability
- DR plan documented
M4 (Week 26) — Shipped¶
- First real user using in production
- No P0 bugs open
- Documentation complete
- Support channels established
M5 (Week 32) — HA¶
- Postgres HA via Patroni
- 99.9% SLA achieved in monitoring
- DR drill successful
Effort estimates¶
| Phase | Engineer-weeks | Calendar weeks |
|---|---|---|
| Foundation (M1) | 8 | 4 |
| Core engine (M2) | 24 | 8 |
| Hardening (M3) | 24 | 8 |
| Ship (M4) | 18 | 6 |
| HA (M5) | 18 | 6 |
| Total to M4 | 74 | 26 |
| Total to M5 | 92 | 32 |
74 engineer-weeks / 3 engineers ≈ 25 calendar weeks (~ 6 months) - aligns with M4 timeline (slight buffer).
Beyond M5: roadmap¶
Phase 1 (M5) is just the beginning. Future phases:
- Phase 2 (months 9-12): Active-active engines, multi-tenancy strict, SDK polish, more BPMN elements
- Phase 3 (months 13-18): Tenant sharding strategy, SaaS deployment, billing
- Phase 4 (months 19+): Citus migration, multi-region, advanced features
Links¶
- analysis/blueprint-plataforma-simplificada — Detailed spec
- analysis/scaling-strategy-postgres — Phase 0 → Phase 6 roadmap
- adrs/index — Architectural decisions
- analysis/security-threat-model — Threats mitigated by week 20
- analysis/failure-mode-analysis — Failures mitigated by week 18-20
- analysis/self-review-critical-gaps — Gaps closed in weeks 13-15