ADR-010: Hybrid monitoring: APM + minimal Process Inspector¶
- Status: Accepted
- Date: 2026-05-14
- Tags: webapps, monitoring, build-vs-buy
- Supersedes: Spec inicial de Operate full (~10K LOC)
Context and Problem Statement¶
Inicialmente el plan era build Operate completo (~10K LOC en 3 sprints). Pero al analizar más profundamente: Operate hace 30% system monitoring + 70% business process state inspection. APM tools (Dynatrace, Datadog, Grafana, New Relic) cubren el 30% MEJOR que cualquier custom build. ¿Construimos Operate full, o hybrid?
Decision Drivers¶
- APM tools especializados son superiores en monitoring (ML anomaly detection, distributed tracing)
- Solo el 70% de Operate (business state) requiere build custom
- Cost: APM SaaS es $1-10K/año vs $30-55K para build + maintain
- OpenTelemetry permite swap de vendor (no lock-in)
- CLI + minimal UI cubren business operations sin Operate full
Considered Options¶
- Hybrid: APM + minimal Process Inspector + CLI (recomendado)
- Build Operate full (~10K LOC, spec original)
- Solo APM (sin Process Inspector)
- Solo CLI (sin UI)
- Build Operate completo paridad Camunda (~50K LOC con todas las features)
Decision Outcome¶
Chosen option: Hybrid approach porque: - APM cubre system monitoring superior a cualquier build custom - Minimal Process Inspector (~2K LOC) cubre 80% del business state UI needs - CLI cubre power users y batch operations - OpenTelemetry instrumentation permite vendor-agnostic - Saves ~7.5K LOC y ~$48K año 1
Positive Consequences¶
- 70% reducción LOC vs build Operate full
- Mejor monitoring quality (Dynatrace ML > custom code)
- Distributed tracing gratis (no build needed)
- Vendor-agnostic via OpenTelemetry
- Lower maintenance burden permanente
- Saves ~$48K año 1, ~$24K/año ongoing
Negative Consequences¶
- APM licensing cost ongoing ($1-10K/año)
- Sin BPMN visual con state overlay built-in (Phase 2 si needed)
- Sin auto-refresh real-time en Inspector (manual refresh OK)
- Multiple tools (UI + CLI + APM) require coordination en docs
Pros and Cons of the Options¶
Hybrid (APM + Inspector + CLI)¶
Pros: - Best monitoring quality - Minimal custom LOC - Vendor-agnostic - Cost effective - Distributed tracing free - ML anomaly detection (free with APM)
Cons: - APM cost ongoing - Multiple tools to learn - No BPMN visual in Inspector
Build Operate full¶
Pros: - Single tool - Brand propio - Full control
Cons: - ~10K LOC initial - Maintenance burden - Inferior a APM en system monitoring - Sin ML / distributed tracing
Solo APM¶
Pros: - Cero custom LOC - Best monitoring
Cons: - No puede mutar state (resolve incident, cancel) - Sin context BPMN - Business users no quieren usar APM UI
Solo CLI¶
Pros: - Mínimo build (~500 LOC) - Power users productivos
Cons: - Business operators no usan CLI - Sin discoverability features
Operate completo paridad¶
Pros: - Feature parity con Camunda
Cons: - ~50K LOC inversión - ~12 meses build time - Aún inferior a Camunda en features - Throughput impact (export pipeline)
Architecture del hybrid approach¶
flowchart TD
subgraph Users
SRE[SREs / DevOps]
BIZ[Business Ops]
PWR[Power users<br/>CI/CD, scripts]
end
APM["APM (buy)<br/>$1-10K/y"]
INS["Process Inspector<br/>(2K LOC)"]
CLI["CLI (build)<br/>(500 LOC)"]
API[Engine REST API]
ENG["Engine + PostgreSQL<br/>• OpenTelemetry<br/>• Metrics<br/>• Logs"]
SRE --> APM
BIZ --> INS
PWR --> CLI
INS --> API
CLI --> API
API --> ENG
ENG -.->|"otel traces / metrics / logs"| APM
Componentes del hybrid¶
1. APM tool (BUY)¶
Stack options: - Dynatrace — ML anomaly detection mejor - Datadog — más popular, generous free tier eventualmente caro - Grafana Cloud — free tier generoso, pay-per-use - New Relic — APM established - Honeycomb — observability-first, great traces - Self-hosted: Prometheus + Grafana + Tempo + Loki (free, más ops)
Cost típico: $1-10K/año para MVP scale.
Cubre: - Infrastructure (CPU, memory, disk, network) - DB performance (query times, connections, locks) - Application metrics (TPS, latency p50/p90/p99) - Distributed tracing (gateway → engine → worker → external) - Logs aggregation + search - Alerts + anomaly detection - Dashboards
2. Process Inspector (BUILD, ~2K LOC)¶
Minimal UI para business operators:
┌─────────────────────────────────────────┐
│ 🔍 [Search: instance key / business ID] │
│ │
│ Quick filters: [Active] [Incidents] │
│ │
│ Custom filter: │
│ Process: [_______ ▼] │
│ State: [Active ▼] │
│ │
│ Results: │
│ ┌─────────────────────────────────────┐ │
│ │ #12345 | order-approval | ACTIVE │ │
│ │ Current: Manager Approval (bob) │ │
│ │ [Details] [Variables] [Cancel] │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────┘
Features: - Search + filter - Detalle modal (current elements, tree path) - Variables modal (JSON tree) - Resolve incident button - Cancel instance button
NO: - ❌ BPMN visual viewer - ❌ Auto-refresh - ❌ Batch operations UI (use CLI) - ❌ Modification / Migration
3. CLI ops (BUILD, ~500 LOC)¶
Para power users, automation, CI/CD:
# Listing
mvp-cli process-instances list --state=INCIDENT --since=24h
# Inspection
mvp-cli process-instances get 12345
mvp-cli process-instances variables 12345
# Operations
mvp-cli process-instances cancel 12345 --reason="duplicate"
mvp-cli incidents resolve 9876
# Batch (advanced)
mvp-cli process-instances list --state=INCIDENT --format=json \
| jq -r '.[] | .processInstanceKey' \
| xargs -I {} mvp-cli incidents resolve --pi {}
DevOps aman esto. Business users usan UI.
4. OpenTelemetry instrumentation (BUILD, ~500 LOC)¶
Engine emite metrics + traces + logs via OTel SDK:
from opentelemetry import metrics, trace
meter = metrics.get_meter("workflow-engine")
process_instances_created = meter.create_counter(
"process_instances.created"
)
processing_latency = meter.create_histogram(
"processing.latency", unit="ms"
)
tracer = trace.get_tracer("workflow-engine")
# Per command processing
with tracer.start_as_current_span("process_command") as span:
span.set_attribute("command.intent", intent)
span.set_attribute("tenant.id", tenant_id)
# ... process
Configure exporter via env vars:
# Dynatrace
OTEL_EXPORTER_OTLP_ENDPOINT=https://yourtenant.live.dynatrace.com/api/v2/otlp/v1
# Datadog
OTEL_EXPORTER_OTLP_ENDPOINT=http://datadog-agent:4317
# Grafana Cloud
OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod.grafana.net/otlp
User cambia APM = cambiar env vars. Cero vendor lock-in.
Métricas business-specific clave¶
Beyond standard system metrics:
| Métrica | Type | Labels |
|---|---|---|
process_instances.created |
counter | tenant, bpmn_process_id |
process_instances.completed |
counter | tenant, bpmn_process_id, outcome |
process_instances.canceled |
counter | tenant, reason |
process_instances.duration |
histogram | tenant, bpmn_process_id |
process_instances.active.count |
gauge | tenant, bpmn_process_id |
jobs.activated |
counter | tenant, job_type |
jobs.completed |
counter | tenant, job_type, retries |
jobs.failed |
counter | tenant, job_type, error_type |
jobs.duration |
histogram | tenant, job_type |
incidents.created |
counter | tenant, error_type |
incidents.resolved |
counter | tenant, error_type |
incidents.active.count |
gauge | tenant |
commands.processed |
counter | intent, tenant |
commands.processing.duration |
histogram | intent |
Cost comparison¶
Para team de 3 devs durante 1 año:
Build Operate full:
Dev time: 3 sprints × 2 devs ≈ $30K
Maintenance: ~15% time of 1 dev = $25K/año
Total año 1: ~$55K
Total año 2+: ~$25K/año
Hybrid:
Dev: 2 weeks × 1 dev = ~$5K
APM license (5 hosts × $20/host × 12 months): ~$1.2K/año
Total año 1: ~$6.2K
Total año 2+: ~$1.2K/año + minor maintenance
Saving año 1: ~$48.8K
Saving recurring: ~$24K/año
Cuándo SÍ build Operate full¶
4 casos donde hybrid NO funciona:
- Compliance airgap — datos no pueden salir a SaaS APM
- BPMN visual crítico para users
- Volume extremo — APM cost prohibitivo
- White-label SaaS — branding requires custom
Para 95% de casos, ninguno aplica.
Links¶
- analysis/operate-vs-apm-tools — Análisis completo
- analysis/operate-tasklist-mvp-detailed — Spec original (deprecated en favor de hybrid)
- adrs/adr-009-skip-optimize-use-grafana — Complementario
- adrs/adr-011-opentelemetry-instrumentation — Foundation técnica
- OpenTelemetry