Saltar a contenido

ADR-010: Hybrid monitoring: APM + minimal Process Inspector

  • Status: Accepted
  • Date: 2026-05-14
  • Tags: webapps, monitoring, build-vs-buy
  • Supersedes: Spec inicial de Operate full (~10K LOC)

Context and Problem Statement

Inicialmente el plan era build Operate completo (~10K LOC en 3 sprints). Pero al analizar más profundamente: Operate hace 30% system monitoring + 70% business process state inspection. APM tools (Dynatrace, Datadog, Grafana, New Relic) cubren el 30% MEJOR que cualquier custom build. ¿Construimos Operate full, o hybrid?

Decision Drivers

  • APM tools especializados son superiores en monitoring (ML anomaly detection, distributed tracing)
  • Solo el 70% de Operate (business state) requiere build custom
  • Cost: APM SaaS es $1-10K/año vs $30-55K para build + maintain
  • OpenTelemetry permite swap de vendor (no lock-in)
  • CLI + minimal UI cubren business operations sin Operate full

Considered Options

  1. Hybrid: APM + minimal Process Inspector + CLI (recomendado)
  2. Build Operate full (~10K LOC, spec original)
  3. Solo APM (sin Process Inspector)
  4. Solo CLI (sin UI)
  5. Build Operate completo paridad Camunda (~50K LOC con todas las features)

Decision Outcome

Chosen option: Hybrid approach porque: - APM cubre system monitoring superior a cualquier build custom - Minimal Process Inspector (~2K LOC) cubre 80% del business state UI needs - CLI cubre power users y batch operations - OpenTelemetry instrumentation permite vendor-agnostic - Saves ~7.5K LOC y ~$48K año 1

Positive Consequences

  • 70% reducción LOC vs build Operate full
  • Mejor monitoring quality (Dynatrace ML > custom code)
  • Distributed tracing gratis (no build needed)
  • Vendor-agnostic via OpenTelemetry
  • Lower maintenance burden permanente
  • Saves ~$48K año 1, ~$24K/año ongoing

Negative Consequences

  • APM licensing cost ongoing ($1-10K/año)
  • Sin BPMN visual con state overlay built-in (Phase 2 si needed)
  • Sin auto-refresh real-time en Inspector (manual refresh OK)
  • Multiple tools (UI + CLI + APM) require coordination en docs

Pros and Cons of the Options

Hybrid (APM + Inspector + CLI)

Pros: - Best monitoring quality - Minimal custom LOC - Vendor-agnostic - Cost effective - Distributed tracing free - ML anomaly detection (free with APM)

Cons: - APM cost ongoing - Multiple tools to learn - No BPMN visual in Inspector

Build Operate full

Pros: - Single tool - Brand propio - Full control

Cons: - ~10K LOC initial - Maintenance burden - Inferior a APM en system monitoring - Sin ML / distributed tracing

Solo APM

Pros: - Cero custom LOC - Best monitoring

Cons: - No puede mutar state (resolve incident, cancel) - Sin context BPMN - Business users no quieren usar APM UI

Solo CLI

Pros: - Mínimo build (~500 LOC) - Power users productivos

Cons: - Business operators no usan CLI - Sin discoverability features

Operate completo paridad

Pros: - Feature parity con Camunda

Cons: - ~50K LOC inversión - ~12 meses build time - Aún inferior a Camunda en features - Throughput impact (export pipeline)

Architecture del hybrid approach

flowchart TD
    subgraph Users
        SRE[SREs / DevOps]
        BIZ[Business Ops]
        PWR[Power users<br/>CI/CD, scripts]
    end
    APM["APM (buy)<br/>$1-10K/y"]
    INS["Process Inspector<br/>(2K LOC)"]
    CLI["CLI (build)<br/>(500 LOC)"]
    API[Engine REST API]
    ENG["Engine + PostgreSQL<br/>• OpenTelemetry<br/>• Metrics<br/>• Logs"]
    SRE --> APM
    BIZ --> INS
    PWR --> CLI
    INS --> API
    CLI --> API
    API --> ENG
    ENG -.->|"otel traces / metrics / logs"| APM

Componentes del hybrid

1. APM tool (BUY)

Stack options: - Dynatrace — ML anomaly detection mejor - Datadog — más popular, generous free tier eventualmente caro - Grafana Cloud — free tier generoso, pay-per-use - New Relic — APM established - Honeycomb — observability-first, great traces - Self-hosted: Prometheus + Grafana + Tempo + Loki (free, más ops)

Cost típico: $1-10K/año para MVP scale.

Cubre: - Infrastructure (CPU, memory, disk, network) - DB performance (query times, connections, locks) - Application metrics (TPS, latency p50/p90/p99) - Distributed tracing (gateway → engine → worker → external) - Logs aggregation + search - Alerts + anomaly detection - Dashboards

2. Process Inspector (BUILD, ~2K LOC)

Minimal UI para business operators:

┌─────────────────────────────────────────┐
│ 🔍 [Search: instance key / business ID] │
│                                          │
│ Quick filters: [Active] [Incidents]      │
│                                          │
│ Custom filter:                           │
│ Process: [_______ ▼]                     │
│ State:   [Active ▼]                      │
│                                          │
│ Results:                                 │
│ ┌─────────────────────────────────────┐ │
│ │ #12345 | order-approval | ACTIVE    │ │
│ │ Current: Manager Approval (bob)     │ │
│ │ [Details] [Variables] [Cancel]      │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────┘

Features: - Search + filter - Detalle modal (current elements, tree path) - Variables modal (JSON tree) - Resolve incident button - Cancel instance button

NO: - ❌ BPMN visual viewer - ❌ Auto-refresh - ❌ Batch operations UI (use CLI) - ❌ Modification / Migration

3. CLI ops (BUILD, ~500 LOC)

Para power users, automation, CI/CD:

# Listing
mvp-cli process-instances list --state=INCIDENT --since=24h

# Inspection
mvp-cli process-instances get 12345
mvp-cli process-instances variables 12345

# Operations
mvp-cli process-instances cancel 12345 --reason="duplicate"
mvp-cli incidents resolve 9876

# Batch (advanced)
mvp-cli process-instances list --state=INCIDENT --format=json \
  | jq -r '.[] | .processInstanceKey' \
  | xargs -I {} mvp-cli incidents resolve --pi {}

DevOps aman esto. Business users usan UI.

4. OpenTelemetry instrumentation (BUILD, ~500 LOC)

Engine emite metrics + traces + logs via OTel SDK:

from opentelemetry import metrics, trace

meter = metrics.get_meter("workflow-engine")
process_instances_created = meter.create_counter(
    "process_instances.created"
)
processing_latency = meter.create_histogram(
    "processing.latency", unit="ms"
)

tracer = trace.get_tracer("workflow-engine")

# Per command processing
with tracer.start_as_current_span("process_command") as span:
    span.set_attribute("command.intent", intent)
    span.set_attribute("tenant.id", tenant_id)
    # ... process

Configure exporter via env vars:

# Dynatrace
OTEL_EXPORTER_OTLP_ENDPOINT=https://yourtenant.live.dynatrace.com/api/v2/otlp/v1

# Datadog
OTEL_EXPORTER_OTLP_ENDPOINT=http://datadog-agent:4317

# Grafana Cloud  
OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod.grafana.net/otlp

User cambia APM = cambiar env vars. Cero vendor lock-in.

Métricas business-specific clave

Beyond standard system metrics:

Métrica Type Labels
process_instances.created counter tenant, bpmn_process_id
process_instances.completed counter tenant, bpmn_process_id, outcome
process_instances.canceled counter tenant, reason
process_instances.duration histogram tenant, bpmn_process_id
process_instances.active.count gauge tenant, bpmn_process_id
jobs.activated counter tenant, job_type
jobs.completed counter tenant, job_type, retries
jobs.failed counter tenant, job_type, error_type
jobs.duration histogram tenant, job_type
incidents.created counter tenant, error_type
incidents.resolved counter tenant, error_type
incidents.active.count gauge tenant
commands.processed counter intent, tenant
commands.processing.duration histogram intent

Cost comparison

Para team de 3 devs durante 1 año:

Build Operate full:
  Dev time: 3 sprints × 2 devs ≈ $30K
  Maintenance: ~15% time of 1 dev = $25K/año
  Total año 1: ~$55K
  Total año 2+: ~$25K/año

Hybrid:
  Dev: 2 weeks × 1 dev = ~$5K
  APM license (5 hosts × $20/host × 12 months): ~$1.2K/año
  Total año 1: ~$6.2K
  Total año 2+: ~$1.2K/año + minor maintenance

Saving año 1: ~$48.8K
Saving recurring: ~$24K/año

Cuándo SÍ build Operate full

4 casos donde hybrid NO funciona:

  1. Compliance airgap — datos no pueden salir a SaaS APM
  2. BPMN visual crítico para users
  3. Volume extremo — APM cost prohibitivo
  4. White-label SaaS — branding requires custom

Para 95% de casos, ninguno aplica.