Saltar a contenido

ADR-009: Skip Optimize, usar Grafana + SQL ad-hoc

  • Status: Accepted
  • Date: 2026-05-14
  • Tags: webapps, analytics, build-vs-buy

Context and Problem Statement

Camunda Optimize provee process analytics y BI (branch analysis, bottleneck detection, KPI dashboards). ~300K LOC. Cuando hay 25-50% throughput reduction si está habilitado. ¿Construimos analytics equivalent o lo skippeamos?

Decision Drivers

  • Optimize es 300K+ LOC (~mismo size que TODO el MVP)
  • Throughput impact 25-50% es significativo
  • SQL ad-hoc cubre 80% de queries analytics típicas
  • Grafana es free + maduro + popular
  • PostgreSQL state es directamente queryable

Considered Options

  1. Skip Optimize, usar SQL + Grafana (recomendado)
  2. Build Optimize-equivalent simplified (~50K LOC subset)
  3. Build Optimize-equivalent full (~300K LOC, paridad)
  4. Integrate con tools especializados (Celonis, ProcessGold, Disco)
  5. OLAP separado (export a ClickHouse/BigQuery)

Decision Outcome

Chosen option: Skip Optimize, usar Grafana + SQL ad-hoc porque: - 80% de queries analytics se resuelven con SQL aggregates simples - Grafana provee dashboards profesionales sin build - Cero throughput impact (queries van a read replicas) - MVP ship JSON dashboards predefinidos - Reconsiderar solo si negocio pide process mining explicitly

Positive Consequences

  • 300K LOC ahorrados (massive)
  • Sin throughput penalty del export pipeline a Optimize
  • Grafana dashboards versionables en git (GitOps)
  • SQL queries explorables ad-hoc (responden preguntas nuevas)
  • PostgreSQL existing tooling (psql, pgAdmin) cubre power users
  • Cero servicio adicional a operar

Negative Consequences

  • Sin process mining real (path discovery, conformance)
  • Sin branch analysis automático
  • Sin outlier detection ML-based
  • SQL skill requerido para queries ad-hoc
  • Custom dashboards requieren conocer schema
  • Cliente puede esperar feature paridad con Camunda

Pros and Cons of the Options

Skip + Grafana

Pros: - Cero LOC propio - Throughput impact zero - Tool maduro - GitOps compatible - SQL skill ubiquitous

Cons: - Sin process mining - Sin branch analysis automático - SQL skill required

Build simplified

Pros: - Brand propio - Subset cubrir 50% de Optimize value

Cons: - Aún 50K LOC inversión - Inferior a Optimize maduro - Maintenance burden

Build full paridad

Pros: - Feature parity

Cons: - ~300K LOC = ~3 años eng - Inferior al original probablemente - Throughput impact replicado - Massive opportunity cost

Tools especializados

Pros: - Best-in-class process mining - ML-based insights

Cons: - Expensive ($$$$/año licensing) - Integration work - Vendor lock-in - Overkill para early users

OLAP separado

Pros: - Analytics performance superior - No impacto en engine

Cons: - Pipeline de export adicional - Otro store a mantener - Costo

Lo que SQL + Grafana cubre

Query examples

-- Instances iniciadas por hora (últimos 7 días)
SELECT 
    date_trunc('hour', start_date) AS hour,
    COUNT(*) AS instances_started
FROM process_instances
WHERE tenant_id = $1
  AND start_date >= NOW() - INTERVAL '7 days'
GROUP BY hour
ORDER BY hour;

-- Tasa de completion por proceso
SELECT 
    bpmn_process_id,
    COUNT(*) FILTER (WHERE state = 'COMPLETED') AS completed,
    COUNT(*) FILTER (WHERE state = 'CANCELED') AS canceled,
    COUNT(*) FILTER (WHERE state = 'INCIDENT') AS incidents,
    AVG(EXTRACT(EPOCH FROM (end_date - start_date))) FILTER (WHERE state = 'COMPLETED') AS avg_duration_sec
FROM process_instances
WHERE tenant_id = $1
  AND start_date >= NOW() - INTERVAL '30 days'
GROUP BY bpmn_process_id;

-- Bottleneck detection (elementos que toman más tiempo)
SELECT 
    pi.bpmn_process_id,
    ei.element_id,
    ei.element_type,
    COUNT(*) AS executions,
    AVG(EXTRACT(EPOCH FROM (ei.completed_at - ei.activated_at))) AS avg_seconds,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (ei.completed_at - ei.activated_at))) AS p50,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (ei.completed_at - ei.activated_at))) AS p95
FROM element_instances ei
JOIN process_instances pi ON pi.process_instance_key = ei.process_instance_key
WHERE pi.tenant_id = $1
  AND ei.completed_at IS NOT NULL
  AND ei.activated_at >= NOW() - INTERVAL '7 days'
GROUP BY pi.bpmn_process_id, ei.element_id, ei.element_type
HAVING COUNT(*) > 10
ORDER BY avg_seconds DESC
LIMIT 20;

-- Branch analysis (gateway decisions)
SELECT 
    pi.bpmn_process_id,
    ei.element_id AS gateway,
    sf.target_element_id AS taken_path,
    COUNT(*) AS times_taken,
    100.0 * COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY pi.bpmn_process_id, ei.element_id) AS percentage
FROM element_instances ei
JOIN sequence_flows_taken sf ON sf.source_element_instance_key = ei.element_instance_key
JOIN process_instances pi ON pi.process_instance_key = ei.process_instance_key
WHERE ei.element_type = 'EXCLUSIVE_GATEWAY'
  AND ei.activated_at >= NOW() - INTERVAL '30 days'
GROUP BY pi.bpmn_process_id, ei.element_id, sf.target_element_id
ORDER BY pi.bpmn_process_id, ei.element_id, percentage DESC;

Grafana dashboards

MVP ships 5 dashboards JSON preconfigurados:

  1. Platform Health — engine TPS, latency, errors, infra
  2. Process Throughput — instances by process, completion rates
  3. Worker Performance — jobs metrics, durations, failures
  4. Incident Tracking — incident rate, resolution time, top errors
  5. Postgres Performance — slow queries, cache hit, connections

User imports JSON to Grafana → instant monitoring.

Lo que NO cubre (cuando build SÍ)

Process mining real

Detection de paths actuales vs path modelado (conformance checking):

Modelo BPMN:
  start → check_credit → if good: approve / else: reject

Realidad observada:
  85%: start → check_credit → approve  ✓
  10%: start → check_credit → reject   ✓
  5%:  start → check_credit → approve → cancel  ⚠ desviación

SQL puro no detecta este 5% sin lógica custom. Process mining tools como Celonis sí.

Outlier detection ML-based

"Esta instance se tomó 10x el promedio — anomaly".

SQL puede detectar outliers estadísticos. ML detecta patterns más sutiles. Para anomaly detection: - Phase 1: queries SQL con percentiles - Phase 2 (si needed): integrate con APM ML (Dynatrace) — recordar ADR-010

Predictive analytics

"Basado en current state, ETA de completion = X horas". Requires modelos ML entrenados.

Para MVP: NO. Defer until business demands.

Trigger para reconsiderar build

Construir analytics SI:

  1. Cliente paga $$ específicamente por process mining
  2. Compliance requires conformance checking documentado
  3. Optimize features son top-3 reasons for choosing platform
  4. 50% de support tickets piden "give me analytics like Optimize"

Si esto ocurre, probablemente comprar Celonis es mejor que build.

Stack recomendado

flowchart TD
    PG[(PostgreSQL state store)]
    GF[Grafana datasource: Postgres]
    DB[Dashboards predefinidos JSON]
    U["Users:<br/>• Default dashboards → 80% queries<br/>• Custom queries → power users SQL<br/>• Saved queries / views → repeat usage"]
    PG -->|"read replicas en Phase 1+"| GF
    GF --> DB
    DB --> U

Cost: $0 (Grafana OSS) o ~$50-200/mes (Grafana Cloud).