ADR-009: Skip Optimize, usar Grafana + SQL ad-hoc¶
- Status: Accepted
- Date: 2026-05-14
- Tags: webapps, analytics, build-vs-buy
Context and Problem Statement¶
Camunda Optimize provee process analytics y BI (branch analysis, bottleneck detection, KPI dashboards). ~300K LOC. Cuando hay 25-50% throughput reduction si está habilitado. ¿Construimos analytics equivalent o lo skippeamos?
Decision Drivers¶
- Optimize es 300K+ LOC (~mismo size que TODO el MVP)
- Throughput impact 25-50% es significativo
- SQL ad-hoc cubre 80% de queries analytics típicas
- Grafana es free + maduro + popular
- PostgreSQL state es directamente queryable
Considered Options¶
- Skip Optimize, usar SQL + Grafana (recomendado)
- Build Optimize-equivalent simplified (~50K LOC subset)
- Build Optimize-equivalent full (~300K LOC, paridad)
- Integrate con tools especializados (Celonis, ProcessGold, Disco)
- OLAP separado (export a ClickHouse/BigQuery)
Decision Outcome¶
Chosen option: Skip Optimize, usar Grafana + SQL ad-hoc porque: - 80% de queries analytics se resuelven con SQL aggregates simples - Grafana provee dashboards profesionales sin build - Cero throughput impact (queries van a read replicas) - MVP ship JSON dashboards predefinidos - Reconsiderar solo si negocio pide process mining explicitly
Positive Consequences¶
- 300K LOC ahorrados (massive)
- Sin throughput penalty del export pipeline a Optimize
- Grafana dashboards versionables en git (GitOps)
- SQL queries explorables ad-hoc (responden preguntas nuevas)
- PostgreSQL existing tooling (psql, pgAdmin) cubre power users
- Cero servicio adicional a operar
Negative Consequences¶
- Sin process mining real (path discovery, conformance)
- Sin branch analysis automático
- Sin outlier detection ML-based
- SQL skill requerido para queries ad-hoc
- Custom dashboards requieren conocer schema
- Cliente puede esperar feature paridad con Camunda
Pros and Cons of the Options¶
Skip + Grafana¶
Pros: - Cero LOC propio - Throughput impact zero - Tool maduro - GitOps compatible - SQL skill ubiquitous
Cons: - Sin process mining - Sin branch analysis automático - SQL skill required
Build simplified¶
Pros: - Brand propio - Subset cubrir 50% de Optimize value
Cons: - Aún 50K LOC inversión - Inferior a Optimize maduro - Maintenance burden
Build full paridad¶
Pros: - Feature parity
Cons: - ~300K LOC = ~3 años eng - Inferior al original probablemente - Throughput impact replicado - Massive opportunity cost
Tools especializados¶
Pros: - Best-in-class process mining - ML-based insights
Cons: - Expensive ($$$$/año licensing) - Integration work - Vendor lock-in - Overkill para early users
OLAP separado¶
Pros: - Analytics performance superior - No impacto en engine
Cons: - Pipeline de export adicional - Otro store a mantener - Costo
Lo que SQL + Grafana cubre¶
Query examples¶
-- Instances iniciadas por hora (últimos 7 días)
SELECT
date_trunc('hour', start_date) AS hour,
COUNT(*) AS instances_started
FROM process_instances
WHERE tenant_id = $1
AND start_date >= NOW() - INTERVAL '7 days'
GROUP BY hour
ORDER BY hour;
-- Tasa de completion por proceso
SELECT
bpmn_process_id,
COUNT(*) FILTER (WHERE state = 'COMPLETED') AS completed,
COUNT(*) FILTER (WHERE state = 'CANCELED') AS canceled,
COUNT(*) FILTER (WHERE state = 'INCIDENT') AS incidents,
AVG(EXTRACT(EPOCH FROM (end_date - start_date))) FILTER (WHERE state = 'COMPLETED') AS avg_duration_sec
FROM process_instances
WHERE tenant_id = $1
AND start_date >= NOW() - INTERVAL '30 days'
GROUP BY bpmn_process_id;
-- Bottleneck detection (elementos que toman más tiempo)
SELECT
pi.bpmn_process_id,
ei.element_id,
ei.element_type,
COUNT(*) AS executions,
AVG(EXTRACT(EPOCH FROM (ei.completed_at - ei.activated_at))) AS avg_seconds,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (ei.completed_at - ei.activated_at))) AS p50,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (ei.completed_at - ei.activated_at))) AS p95
FROM element_instances ei
JOIN process_instances pi ON pi.process_instance_key = ei.process_instance_key
WHERE pi.tenant_id = $1
AND ei.completed_at IS NOT NULL
AND ei.activated_at >= NOW() - INTERVAL '7 days'
GROUP BY pi.bpmn_process_id, ei.element_id, ei.element_type
HAVING COUNT(*) > 10
ORDER BY avg_seconds DESC
LIMIT 20;
-- Branch analysis (gateway decisions)
SELECT
pi.bpmn_process_id,
ei.element_id AS gateway,
sf.target_element_id AS taken_path,
COUNT(*) AS times_taken,
100.0 * COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY pi.bpmn_process_id, ei.element_id) AS percentage
FROM element_instances ei
JOIN sequence_flows_taken sf ON sf.source_element_instance_key = ei.element_instance_key
JOIN process_instances pi ON pi.process_instance_key = ei.process_instance_key
WHERE ei.element_type = 'EXCLUSIVE_GATEWAY'
AND ei.activated_at >= NOW() - INTERVAL '30 days'
GROUP BY pi.bpmn_process_id, ei.element_id, sf.target_element_id
ORDER BY pi.bpmn_process_id, ei.element_id, percentage DESC;
Grafana dashboards¶
MVP ships 5 dashboards JSON preconfigurados:
- Platform Health — engine TPS, latency, errors, infra
- Process Throughput — instances by process, completion rates
- Worker Performance — jobs metrics, durations, failures
- Incident Tracking — incident rate, resolution time, top errors
- Postgres Performance — slow queries, cache hit, connections
User imports JSON to Grafana → instant monitoring.
Lo que NO cubre (cuando build SÍ)¶
Process mining real¶
Detection de paths actuales vs path modelado (conformance checking):
Modelo BPMN:
start → check_credit → if good: approve / else: reject
Realidad observada:
85%: start → check_credit → approve ✓
10%: start → check_credit → reject ✓
5%: start → check_credit → approve → cancel ⚠ desviación
SQL puro no detecta este 5% sin lógica custom. Process mining tools como Celonis sí.
Outlier detection ML-based¶
"Esta instance se tomó 10x el promedio — anomaly".
SQL puede detectar outliers estadísticos. ML detecta patterns más sutiles. Para anomaly detection: - Phase 1: queries SQL con percentiles - Phase 2 (si needed): integrate con APM ML (Dynatrace) — recordar ADR-010
Predictive analytics¶
"Basado en current state, ETA de completion = X horas". Requires modelos ML entrenados.
Para MVP: NO. Defer until business demands.
Trigger para reconsiderar build¶
Construir analytics SI:
- Cliente paga $$ específicamente por process mining
- Compliance requires conformance checking documentado
- Optimize features son top-3 reasons for choosing platform
-
50% de support tickets piden "give me analytics like Optimize"
Si esto ocurre, probablemente comprar Celonis es mejor que build.
Stack recomendado¶
flowchart TD
PG[(PostgreSQL state store)]
GF[Grafana datasource: Postgres]
DB[Dashboards predefinidos JSON]
U["Users:<br/>• Default dashboards → 80% queries<br/>• Custom queries → power users SQL<br/>• Saved queries / views → repeat usage"]
PG -->|"read replicas en Phase 1+"| GF
GF --> DB
DB --> U
Cost: $0 (Grafana OSS) o ~$50-200/mes (Grafana Cloud).
Links¶
- entities/optimize — Lo que Optimize hace
- analysis/webapps-architecture-mvp — Estrategia webapps
- adrs/adr-010-hybrid-monitoring-apm-inspector — Complementario
- concepts/postgres-monitoring — Queries operacionales