Saltar a contenido

Implementation Roadmap Concrete

Plan concreto de 26 semanas para construir el MVP (Phase 0) + 6 semanas adicionales para Phase 1 (HA). Cada semana tiene deliverables claros, validation criteria, y risk indicators. Team sizing: 1 backend engineer + 1 frontend engineer + 0.5 platform engineer = ~2.5 FTE. Total: ~6.5 meses al MVP funcional, ~8 meses a production-ready Phase 1. Milestones: M1 (week 4) skeleton + first BPMN executes, M2 (week 12) full engine + Tasklist, M3 (week 20) production hardening, M4 (week 26) MVP ship, M5 (week 32) Phase 1 HA.

Team composition asumida

Rol FTE Responsibilidades
Backend engineer (lead) 1.0 Engine core, API, processors
Frontend engineer 1.0 Tasklist, Process Inspector, CLI
Platform engineer 0.5 Infra, CI/CD, observability, deploy
Tech lead / architect 0.25 Reviews, decisions, planning
QA / SRE 0.25 Testing, runbooks, on-call
Total 3.0 FTE

Smaller team possible (1 backend FTE) extending timeline ~50%.

Milestones overview

Milestone Week Deliverable Validation
M1 4 Skeleton runs, simple BPMN executes Hello-world process: start → service task → end
M2 12 Full engine + Tasklist functional 80% test BPMN suite passes
M3 20 Production hardening complete Load test: 100 TPS sustained 4h
M4 26 MVP SHIPPED (Phase 0) First customer / internal use
M5 32 Phase 1 HA running 99.9% SLA validated

Roadmap detallado

Weeks 1-4: Foundation (M1 milestone)

Goal: Skeleton funcional. First BPMN ejecuta end-to-end.

Week 1: Setup + DB schema

Backend: - [ ] Repo setup (Go + Bun/Vite frontend, structure per analysis/self-review-critical-gaps) - [ ] CI/CD básico (lint, test, build) - [ ] Postgres setup local (Docker Compose) - [ ] Initial migrations: - tenants, users, user_tenants, api_keys - process_definitions, deployments - command_log, event_log (with pg_partman) - process_instances, element_instances, variables - jobs, user_tasks, incidents, timers

Platform: - [ ] Infrastructure-as-code (Terraform/Pulumi) - [ ] Dev/staging environments - [ ] Observability setup (OTel collector → local Grafana)

Validation: pg_dump succeeds, migrations apply cleanly, basic SELECT 1 from engine to DB.

Week 2: API skeleton + Auth

Backend: - [ ] REST framework setup (Gin/Fiber/Chi for Go) - [ ] OpenAPI 3.0 spec started - [ ] Auth middleware: - OIDC JWT validation - API key validation - Tenant resolution from token - [ ] Endpoints stubs: - POST /v2/deployments - POST /v2/process-instances - GET /v2/process-instances/{key} - [ ] Per-ADR-024: Postgres RLS policies activated - [ ] Per-ADR-025: audit_log table + write helper

Frontend: - [ ] React/Vue app skeleton - [ ] Auth flow with Auth0/Keycloak local - [ ] Layout (sidebar, header, routing)

Validation: curl -H "Authorization: Bearer <jwt>" /v2/process-instances returns 200 (empty list).

Week 3: Command log + simple processing

Backend: - [ ] Command log INSERT en API - [ ] Engine processing loop (single thread, async) - [ ] LISTEN/NOTIFY signal for new commands - [ ] Simple BPMN parser (start/end/service task only) - [ ] First processor: ProcessInstanceCreationCreateProcessor - [ ] Event log writes (atomic with state)

Tests: - [ ] Replay determinism test scaffolding (ADR-019) - [ ] First property-based test (generate random simple BPMN, verify completion)

Validation:

mvp-cli deploy hello.bpmn
mvp-cli create-instance hello
# → Engine creates instance, persists to DB

Week 4: Service task + workers (M1)

Backend: - [ ] Service task processor - [ ] Job creation on ACTIVATE_ELEMENT - [ ] Jobs activation endpoint with SKIP LOCKED (per concepts/job-queue-fairness) - [ ] Job complete/fail endpoints - [ ] Element completion / next sequence flow

SDK: - [ ] Worker SDK skeleton (Go + TypeScript) - [ ] Hello-world example worker

Validation: end-to-end test:

# Hello world process
deploy('process.bpmn')  # start → service task "say-hello" → end

pi = create_instance('hello')

# Worker
job = activate_jobs('say-hello', maxJobs=1)[0]
complete_job(job.key, {"message": "Hello, world!"})

# Verify
assert get_instance(pi.key).state == 'COMPLETED'

🎯 M1 reached: Skeleton works end-to-end.

Weeks 5-12: Full engine + Tasklist (M2 milestone)

Goal: Most BPMN features work. Tasklist functional.

Weeks 5-6: Gateways + Sequence Flows

Backend: - [ ] Exclusive gateway processor (with CEL expressions per ADR-022) - [ ] Parallel gateway processor (fork + join) - [ ] Default sequence flow - [ ] Condition evaluation (CEL integration) - [ ] Variable scoping (per concepts/variable-scoping)

Tests: - [ ] Property-based: random gateways + paths

Validation: BPMN with if amount > 100 works correctly.

Weeks 7-8: User tasks + Tasklist UI

Backend: - [ ] User task processor (CREATE, ASSIGN, COMPLETE) - [ ] User task search endpoint - [ ] Claim atomic via UPDATE ... RETURNING - [ ] Forms storage + retrieval

Frontend (Tasklist): - [ ] Mis tasks list - [ ] Disponibles list - [ ] Task detail page - [ ] Claim/unclaim actions - [ ] JSON Schema form rendering (rjsf) - [ ] Complete with variables

Validation: human approval flow works end-to-end.

Weeks 9-10: Timers + Messages

Backend: - [ ] Timer processor (per concepts/timer-recovery-postgres) - [ ] DueDate scheduler (in-memory min-heap) - [ ] Recovery scan on startup - [ ] Message correlation (per concepts/message-correlation) - [ ] Boundary timer events - [ ] Intermediate timer events - [ ] Message start events - [ ] Message intermediate catch events

Validation: Process with "wait 1 hour then continue" works. Message from API correlates to waiting process.

Weeks 11-12: Incidents + Inspector (M2)

Backend: - [ ] Incident creation on job_no_retries - [ ] Incident resolution endpoint - [ ] Incident on FEEL evaluation failure - [ ] Process Inspector API (search, filters)

Frontend (Process Inspector): - [ ] Search by key/business ID - [ ] List with filters - [ ] Detail modal - [ ] Variables modal - [ ] Resolve incident - [ ] Cancel instance

🎯 M2 reached: Engine covers 80% of BPMN. Tasklist + Inspector functional.

Weeks 13-20: Production hardening (M3 milestone)

Goal: Production-ready. Comprehensive testing. Observability.

Week 13: Backpressure + rate limiting

Backend: - [ ] Per-tenant rate limiter (token bucket per concepts/backpressure-rest-strategy) - [ ] Engine queue depth monitoring - [ ] HTTP 429/503 responses - [ ] Whitelisted critical commands

Validation: Load test sees rate limiting kick in at configured threshold.

Week 14: Process definition cache

Backend: - [ ] LRU cache for ExecutableProcess (per concepts/process-definition-cache) - [ ] Invalidation via LISTEN/NOTIFY - [ ] Latest version cache (TTL) - [ ] Cache metrics

Validation: Cache hit rate > 95% in load test.

Week 15: Idempotency + worker SDK improvements

Backend: - [ ] Idempotency keys (per concepts/api-engine-serialization) - [ ] Activation count tracking

SDK: - [ ] Worker SDK: idempotency helpers - [ ] Polyglot: Python SDK - [ ] Retry/backoff built-in - [ ] Health checks - [ ] OTel integration in SDK

Validation: Re-submitting same request twice doesn't create duplicate.

Week 16: Observability complete

Platform: - [ ] All business metrics emitted (per adrs/adr-011-opentelemetry-instrumentation) - [ ] Distributed tracing setup - [ ] Log structuring (JSON, no sensitive data per security model) - [ ] Grafana dashboards (5 main per adrs/adr-009-skip-optimize-use-grafana) - [ ] Alert rules (Prometheus Alertmanager)

Validation: Dashboard shows real-time TPS, latency, queue depth.

Week 17: Error handling + Edge cases

Backend: - [ ] Error boundary events - [ ] Throw error (worker → BPMN) - [ ] Retry/backoff logic in JobFailProcessor - [ ] Per-instance event count limit (anti-infinite-loop)

Validation: Process with error handler catches and routes correctly.

Week 18: CLI tool + admin operations

Tool: - [ ] mvp-cli complete: - process-instances list/get/cancel/variables - incidents list/resolve - jobs list/cancel - tenants create/suspend/resume - api-keys create/list/rotate/delete - deploy - batch operations

Validation: All ops user stories doable via CLI.

Week 19: Storage management

Backend: - [ ] pg_partman setup for command_log + event_log - [ ] Retention policies per tenant - [ ] Archival to S3 (cron job) - [ ] Cleanup of old state

Validation: 30-day retention test (synthetic 30 days of data, verify archival).

Week 20: Comprehensive testing (M3)

QA: - [ ] Load test 100 TPS sustained 4 hours - [ ] Failover test (kill engine, verify recovery) - [ ] Chaos test (network partitions, DB hiccups) - [ ] Security test (per analysis/security-threat-model threats) - [ ] Multi-tenant isolation test - [ ] Replay determinism property tests (1000 iterations)

🎯 M3 reached: Production-ready Phase 0.

Weeks 21-26: Polish + ship (M4 milestone)

Goal: Ship to first user.

Week 21: Documentation

Doc team (engineers writing): - [ ] Getting started guide - [ ] BPMN supported subset reference - [ ] API documentation (auto-generated from OpenAPI) - [ ] Worker SDK guides (TS, Python, Go) - [ ] Operations runbook (per analysis/failure-mode-analysis) - [ ] Migration guide from Camunda 8

Week 22: BPMN compatibility testing

QA: - [ ] Test corpus: 50+ real BPMN files - [ ] Conversion tool: Camunda FEEL → CEL - [ ] Document supported/unsupported elements - [ ] Migration tool: import Camunda 8 process definitions

Validation: 80% of test corpus deploys + executes correctly.

Week 23: First user deployment

Platform: - [ ] Production environment provisioned - [ ] Production deployment automation - [ ] Monitoring + alerts in production - [ ] Backup strategy verified (pgBackRest → S3) - [ ] DR drill (restore from backup, validate)

User onboarding: - [ ] First user deploys their workflow - [ ] Provide support - [ ] Iterate on UX feedback

Week 24: Performance tuning

Backend based on production observation: - [ ] Identified slow queries → add indexes - [ ] Cache hit rates → tune sizes - [ ] Connection pool → adjust if needed - [ ] Job activation latency → optimize query

Validation: TP99 < 1s sustained in production.

Week 25: Bug fixes + UX polish

All team: - [ ] Bug bash week - [ ] Polish UI (Tasklist + Inspector) - [ ] Improve error messages - [ ] Documentation gaps filled

Week 26: Ship (M4)

Launch: - [ ] Public announcement / launch - [ ] On-call rotation established - [ ] Customer support channels ready - [ ] Retrospective

🎯 M4 reached: MVP SHIPPED.

Weeks 27-32: Phase 1 (HA) — M5 milestone

Goal: 99.9% SLA via HA.

Week 27-28: Patroni setup

Platform: - [ ] Patroni + etcd cluster (per adrs/adr-020-patroni-postgres-ha) - [ ] 3-node Postgres setup - [ ] PgBouncer in front - [ ] Failover testing

Week 29: Application changes for HA

Backend: - [ ] Retry logic in DB client (handles failover) - [ ] Connection string rotation - [ ] Read replica reads for non-critical queries (Tasklist, Inspector) - [ ] Lag monitoring

Week 30: HA testing

QA: - [ ] Kill primary, verify failover < 30s - [ ] Replication lag tests - [ ] Split-brain prevention tests - [ ] Test PITR recovery

Week 31: Monitoring + runbooks

Platform: - [ ] Patroni dashboards - [ ] Failover runbook - [ ] DR plan documented + drilled - [ ] On-call procedures

Week 32: Production HA rollout (M5)

Migration: - [ ] Migrate production to Patroni cluster (zero-downtime) - [ ] Validate SLA over 2 weeks

🎯 M5 reached: Phase 1 HA operational.

Critical path risks

Risk 1: BPMN parser complexity underestimated

Probability: High (BPMN spec is complex) Impact: Slip by 2-4 weeks Mitigation: - Strict scope (only supported elements per ADR-001 subset) - Reuse libraries (bpmn-go if Go, bpmn-server reference) - Time-box parser work (max 3 weeks total)

Risk 2: Replay determinism bugs slow development

Probability: Medium Impact: Slip by 2 weeks Mitigation: - Property tests from week 3 (early detection) - Strict coding rules (no time.Now() direct) - Code reviews enforce determinism

Risk 3: Tasklist UX rework needed

Probability: Medium Impact: Slip 1-2 weeks Mitigation: - User testing in week 8 (catch issues early) - Reuse design system / component library - Defer advanced features to Phase 2

Risk 4: Performance not meeting target

Probability: Medium Impact: 1-2 weeks tuning Mitigation: - Profile early (week 16) - Cache decisions early (per ADR analysis) - Engineer mindfully (per concepts/microbenchmark-methodology)

Risk 5: Security gaps discovered late

Probability: Medium
Impact: 1-2 weeks remediation Mitigation: - Threat model review in week 12 (M2) - External pentest before M4 ship - Security checklist enforced

Validation criteria per milestone

M1 (Week 4) — Skeleton

  • Engine runs locally via make dev
  • One BPMN process deploys + executes
  • Service task → worker → completion works
  • State persists to Postgres
  • Tests passing in CI

M2 (Week 12) — Functional

  • BPMN coverage: start/end events, service/user tasks, exclusive/parallel gateways, timers, messages
  • Tasklist UI functional
  • Process Inspector UI functional
  • Incidents work
  • 80% of test BPMN corpus passes

M3 (Week 20) — Production-ready

  • 100 TPS sustained 4h without degradation
  • Failover (engine restart) < 30s
  • TP99 < 1s
  • All security threats mitigated (per threat model)
  • Replay determinism tests passing
  • Comprehensive observability
  • DR plan documented

M4 (Week 26) — Shipped

  • First real user using in production
  • No P0 bugs open
  • Documentation complete
  • Support channels established

M5 (Week 32) — HA

  • Postgres HA via Patroni
  • 99.9% SLA achieved in monitoring
  • DR drill successful

Effort estimates

Phase Engineer-weeks Calendar weeks
Foundation (M1) 8 4
Core engine (M2) 24 8
Hardening (M3) 24 8
Ship (M4) 18 6
HA (M5) 18 6
Total to M4 74 26
Total to M5 92 32

74 engineer-weeks / 3 engineers ≈ 25 calendar weeks (~ 6 months) - aligns with M4 timeline (slight buffer).

Beyond M5: roadmap

Phase 1 (M5) is just the beginning. Future phases:

  • Phase 2 (months 9-12): Active-active engines, multi-tenancy strict, SDK polish, more BPMN elements
  • Phase 3 (months 13-18): Tenant sharding strategy, SaaS deployment, billing
  • Phase 4 (months 19+): Citus migration, multi-region, advanced features