Performance testing methodology¶
Cómo medir performance del engine sin auto-engañarse. Workloads sintéticos representativos, métricas que importan, herramientas (k6, vegeta, pgbench), baseline tracking, capacity planning.
Anti-pattern: el "benchmark show"¶
Muchos engines publican benchmarks marketing que no reproducen producción: - "1M jobs/sec!" con tasks vacíos (no service tasks reales). - p99 < 1ms... medido en una región sin GC pressure, sin replication. - Concurrency 1 = single goroutine.
Nuestra disciplina: benchmarks que reflejan la realidad.
Workloads de referencia¶
Workload W1: "Order Flow" (típico)¶
flowchart LR
S([StartEvent])
V[ValidateOrder<br/>service task<br/>50ms p50]
C[ChargePayment<br/>service task<br/>200ms p50, 1% error rate]
Sh[ShipOrder<br/>service task<br/>30ms p50]
E([EndEvent])
S --> V --> C --> Sh --> E
3 service tasks, sin gateways complejos. Representa el 60% de procesos en producción.
Workload W2: "Approval Chain"¶
flowchart LR
S([StartEvent])
CR[CreateRequest<br/>service]
MA[ManagerApproval<br/>user task<br/>p50=2h, variable enorme]
DA[DirectorApproval<br/>user task]
EA[ExecuteAction<br/>service]
E([EndEvent])
BT{boundary timer 24h}
Esc[escalate]
S --> CR --> MA --> DA --> EA --> E
MA -.-> BT -.-> Esc
User tasks + timers. Throughput bajo, instances long-lived (días-semanas).
Workload W3: "Saga Multi-Step"¶
flowchart TD
SP[Subprocess 'booking']
BF[BookFlight - service]
BH[BookHotel - service]
BC[BookCar - service]
OE[On error: compensate<br/>LIFO 3 service tasks]
SP --> BF
SP --> BH
SP --> BC
SP -.-> OE
Compensation, subprocess. Stress test para el motor de compensation.
Workload W4: "High Fan-Out"¶
flowchart LR
S([StartEvent])
PGS{{Parallel Gateway<br/>split into 50 branches}}
ST[50x ServiceTask<br/>parallel]
PGM{{Parallel Gateway<br/>merge}}
E([EndEvent])
S --> PGS --> ST --> PGM --> E
Stress test para parallel tokens y joining.
Workload W5: "Long Variable Payload"¶
flowchart LR
S([StartEvent])
ST1[ServiceTask<br/>variable: JSON 100KB]
ST2[ServiceTask<br/>variable: JSON 100KB merged]
E([EndEvent])
S --> ST1 --> ST2 --> E
Stress test para variable storage (TOAST, JSONB indexes).
Workload W6: "Multi-Tenant Mix"¶
Realista. Mide fairness, RLS overhead, partition contention.
Targets de performance (M1-M4)¶
| Workload | Métrica | M1 target | M2 target | M3 target |
|---|---|---|---|---|
| W1 single instance | E2E latency p50 | 300ms | 250ms | 200ms |
| W1 single instance | E2E latency p99 | 1s | 800ms | 600ms |
| W1 throughput | instances/sec | 100 | 500 | 2000 |
| W1 throughput | commands/sec | 500 | 2500 | 10000 |
| W4 fan-out | join latency | 500ms | 300ms | 200ms |
| W5 big vars | E2E latency | 800ms | 600ms | 500ms |
| W6 mixed | error rate | 0.01% | 0.01% | 0.005% |
| Engine CPU @ target | utilization | < 70% | < 60% | < 50% |
| Postgres CPU @ target | utilization | < 70% | < 60% | < 50% |
Cualquier cambio que regrese > 10% en cualquier métrica: bloqueante para merge.
Stack de herramientas¶
k6 — load testing principal¶
JavaScript scripts, Go runtime. Excelente para HTTP APIs.
// load/w1-order-flow.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
scenarios: {
sustained: {
executor: 'constant-arrival-rate',
rate: 100, // 100 RPS
timeUnit: '1s',
duration: '10m',
preAllocatedVUs: 50,
maxVUs: 200,
},
},
thresholds: {
'http_req_duration{name:start_instance}': ['p(99)<500'],
'http_req_failed': ['rate<0.01'],
},
};
const BASE = __ENV.WF_BASE_URL || 'http://localhost:8080';
const TOKEN = __ENV.WF_TOKEN;
export default function () {
const headers = {
'Content-Type': 'application/json',
'Authorization': `Bearer ${TOKEN}`,
};
const payload = JSON.stringify({
processDefinitionId: 'order-flow',
variables: {
orderId: `O-${__VU}-${__ITER}`,
amount: Math.random() * 1000,
customerId: `cust-${Math.floor(Math.random()*10000)}`,
},
awaitCompletion: true,
timeout: '30s',
});
const res = http.post(`${BASE}/api/v1/instances`, payload, {
headers,
tags: { name: 'start_instance' },
});
check(res, {
'status 200/201': (r) => r.status === 200 || r.status === 201,
'completed in body': (r) => JSON.parse(r.body).status === 'COMPLETED',
});
sleep(0.1);
}
Run:
k6 run --out influxdb=http://influx:8086/k6 \
-e WF_BASE_URL=https://wf-staging.example.com \
-e WF_TOKEN=$TOKEN \
load/w1-order-flow.js
Workers sintéticos¶
Necesitamos workers para que los procesos completen. Workers sintéticos en Go con latencia configurable:
// loadworkers/main.go
func main() {
client, _ := wfclient.New(os.Getenv("WF_URL"))
defer client.Close()
// Worker que simula trabajo
client.RegisterWorker(wfclient.WorkerOptions{
Type: "validate-order",
Concurrency: 32,
}, func(ctx context.Context, job *wfclient.Job) (map[string]any, error) {
// Simulate work: 50ms p50, 200ms p99 (lognormal)
simulateWork(50*time.Millisecond, 200*time.Millisecond)
return map[string]any{"valid": true}, nil
})
client.RegisterWorker(wfclient.WorkerOptions{
Type: "charge-payment",
Concurrency: 16,
}, func(ctx context.Context, job *wfclient.Job) (map[string]any, error) {
simulateWork(200*time.Millisecond, 800*time.Millisecond)
// 1% error rate (BPMN error)
if rand.Float64() < 0.01 {
return nil, wfclient.BPMNError("insufficient-funds", "")
}
return map[string]any{"chargeId": uuid.New().String()}, nil
})
client.Run(context.Background())
}
pgbench / pg_stat_statements¶
Para identificar bottlenecks en Postgres:
-- Top 10 queries by total time
SELECT query, calls, total_exec_time, mean_exec_time, rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
-- Lock contention
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON ...
perf / async-profiler (Linux)¶
Para profiling de hot paths del engine:
# CPU flame graph
perf record -F 99 -p $(pidof wf-engine) -- sleep 30
perf script | flamegraph.pl > flame.svg
# Go pprof
curl http://wf-engine:6060/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof -http :8081 cpu.prof
Telemetry continuo (durante load test)¶
Prometheus + Grafana dashboards. Mientras corre el load: - CPU, memory, GC del engine. - Postgres: locks, cache hit ratio, IOPS, WAL rate. - App: latencias p50/p95/p99, error rate, throughput.
Baseline tracking¶
Cada PR debe NO regredir vs baseline. Cómo:
Continuous benchmarking (CI)¶
# .github/workflows/perf.yml
on:
pull_request:
branches: [main]
jobs:
perf-regression:
runs-on: self-hosted-perf # bare-metal, dedicated
steps:
- name: Setup
run: ./scripts/setup-perf-env.sh
- name: Run baseline (main)
run: |
git checkout main
./scripts/run-bench.sh > baseline.json
- name: Run PR
run: |
git checkout $PR_BRANCH
./scripts/run-bench.sh > pr.json
- name: Compare
run: |
./scripts/compare-bench.py baseline.json pr.json \
--threshold 0.10 --fail-on-regression
Output:
W1 throughput: 100 → 95 instances/sec (-5.0%) ✅ within threshold
W1 latency p99: 820ms → 920ms (+12.2%) ❌ REGRESSION
Long-running tracking (Grafana)¶
Daily perf job en staging: - Mide W1/W4/W6 sostained 1h. - Publica métricas a Prometheus. - Dashboard: throughput / latencia día a día.
Si algo cambió pero no causó CI failure (e.g., behavioral change con feature flag), aparece en el dashboard.
Capacity planning¶
Saturation curves¶
Encontrar el "knee" donde latencia explota:
Operar a ~70% del knee para tener buffer.
Resource model¶
A partir de mediciones, derivar modelo:
CPU usage (engine) = 0.5 ms/command × command_rate
Memory (engine) = 200 MB base + 50 KB × active_instances
Postgres CPU = 0.3 ms/command × command_rate × 1.2 (overhead)
Postgres connections = num_engine_nodes × pool_size
Postgres WAL = 5 KB/command × command_rate
Postgres disk growth = 8 KB/command × command_rate (data + audit)
Calculator:
def capacity(target_rps_commands, instance_count_active):
engine_cpu_cores = target_rps_commands * 0.0005 / 0.7 # 70% utilization
engine_memory_mb = 200 + 50/1024 * instance_count_active
pg_cpu_cores = target_rps_commands * 0.0003 * 1.2 / 0.7
pg_iops_writes = target_rps_commands * 2
pg_disk_gb_per_day = target_rps_commands * 8e-6 * 86400
return {
'engine_nodes': max(2, math.ceil(engine_cpu_cores / 4)), # 4-core nodes
'engine_memory_per_node_gb': math.ceil(engine_memory_mb / num_nodes / 1024),
'postgres_size': postgres_sku(pg_cpu_cores, pg_iops_writes),
'storage_growth_gb_per_day': pg_disk_gb_per_day,
}
Métricas SLI vs metrics noise¶
SLI (Service Level Indicators) — métricas que reflejan UX:
- Process latency p99 (start → complete, happy path) < 1s
- API availability: 2xx/all > 99.9%
- Job activation latency p99 (job created → activated) < 100ms
- Incident creation rate: < 0.1% de jobs
Metrics de salud interna — para diagnóstico, no SLO:
- CPU, memory, GC
- Postgres locks, cache hit
- Channel/queue depths
No confundir: las internas pueden tener "wobble" sin que afecte UX.
Load test cadence¶
| Tipo | Cuándo | Duración | Workload |
|---|---|---|---|
| Smoke | Cada PR | 5 min | W1 light |
| Regression | Cada PR | 15 min | W1, W4 |
| Stress | Nightly | 1h | W1-W6 mix |
| Soak | Weekly | 24h | W6 |
| Spike | Pre-release | 30 min | W1 con spike 10× |
| Chaos | Monthly | 2h | W6 + chaos-mesh |
Soak test (24h)¶
Crítico para detectar: - Memory leaks - File descriptor leaks - Goroutine leaks - DB connection leaks - Disk slow-growing (audit log compaction insufficient) - Eventual GC pressure - Resource fragmentation
# Soak con baseline rate constant
k6 run --duration 24h --vus 100 load/w6-mixed.js
# Track durante 24h:
# - Memory en crecimiento? → leak
# - Latencia degradando? → fragmentation
# - GC time creciendo? → memory pressure
Reporte de un load test¶
# Load test report: W1 sustained 1h, M1 staging
## Configuration
- Engine: 2× 4-vCPU 8GB
- Postgres: db.r6g.xlarge (4 vCPU 32GB)
- Workload: W1 at 200 RPS
## Results
| Metric | Target | Actual | Status |
|---|---|---|---|
| Throughput | 200 inst/s | 198 | ✅ |
| Latency p50 | 200ms | 185ms | ✅ |
| Latency p99 | 600ms | 920ms | ❌ |
| Engine CPU | <70% | 65% | ✅ |
| Postgres CPU | <70% | 82% | ❌ |
| Error rate | <0.1% | 0.05% | ✅ |
## Bottleneck analysis
- Postgres CPU 82%: top query is `INSERT INTO commands ...` (45% time)
- pg_stat_statements shows mean_exec_time 1.2ms (target <1ms)
- WAL flushing limited by single fsync per commit
## Action items
- [ ] Tune Postgres: increase shared_buffers
- [ ] Batch commits in engine (group commit pattern)
- [ ] Profile worker connection pool
Hardware recommendations por phase¶
Ver analysis/sizing-benchmarks para detalle. Resumen:
| Phase | Engine | Postgres | Notas |
|---|---|---|---|
| M1 (dev/test) | 1× 2-vCPU 4GB | 1× 2-vCPU 8GB SSD | Single node |
| M2 (small prod) | 2× 4-vCPU 8GB | 2× 4-vCPU 32GB SSD | Patroni HA |
| M3 (mid prod) | 3× 8-vCPU 16GB | 3× 8-vCPU 64GB NVMe | Multi-AZ |
| M4 (large prod) | 6× 16-vCPU 32GB | Citus 3+3 workers, 16-vCPU 128GB | Sharded |
Anti-patterns en load tests¶
❌ Ramp instantáneo (cold cache)¶
// BIEN
options = {
stages: [
{ duration: '2m', target: 200 }, // ramp-up
{ duration: '10m', target: 200 }, // sustained measurement
],
}
❌ Workers undersized¶
Engine procesa 1000 jobs/s pero workers solo 100/s.
Latencia inflada porque jobs encolan.
Estás midiendo el throughput del worker, no del engine.
❌ Tester en misma red que el engine¶
Network latency dominates; no se ve el verdadero comportamiento del engine.
❌ Sin reset entre runs¶
Postgres state acumula. Cada run debe partir de estado conocido.
Roadmap¶
- M1: k6 scripts W1, baseline manual, Prometheus.
- M2: CI continuous benchmarking, soak nightly.
- M3: Capacity calculator integrado en CLI (
wf capacity --target-rps 500). - M4: Geo-distributed load tests.
Referencias¶
- analysis/sizing-benchmarks — capacity sizing real
- analysis/intuit-production-benchmarks — caso real Intuit
- analysis/observability-deep-dive — métricas durante load
- concepts/microbenchmark-methodology — micro-benchmarks unitarios
- k6 docs
- Latency tip of the iceberg (Gil Tene) — percentile pitfalls