Performance testing methodology¶

Cómo medir performance del engine sin auto-engañarse. Workloads sintéticos representativos, métricas que importan, herramientas (k6, vegeta, pgbench), baseline tracking, capacity planning.

Anti-pattern: el "benchmark show"¶

Muchos engines publican benchmarks marketing que no reproducen producción: - "1M jobs/sec!" con tasks vacíos (no service tasks reales). - p99 < 1ms... medido en una región sin GC pressure, sin replication. - Concurrency 1 = single goroutine.

Nuestra disciplina: benchmarks que reflejan la realidad.

Workloads de referencia¶

Workload W1: "Order Flow" (típico)¶

flowchart LR
    S([StartEvent])
    V[ValidateOrder<br/>service task<br/>50ms p50]
    C[ChargePayment<br/>service task<br/>200ms p50, 1% error rate]
    Sh[ShipOrder<br/>service task<br/>30ms p50]
    E([EndEvent])
    S --> V --> C --> Sh --> E

3 service tasks, sin gateways complejos. Representa el 60% de procesos en producción.

Workload W2: "Approval Chain"¶

flowchart LR
    S([StartEvent])
    CR[CreateRequest<br/>service]
    MA[ManagerApproval<br/>user task<br/>p50=2h, variable enorme]
    DA[DirectorApproval<br/>user task]
    EA[ExecuteAction<br/>service]
    E([EndEvent])
    BT{boundary timer 24h}
    Esc[escalate]
    S --> CR --> MA --> DA --> EA --> E
    MA -.-> BT -.-> Esc

User tasks + timers. Throughput bajo, instances long-lived (días-semanas).

Workload W3: "Saga Multi-Step"¶

flowchart TD
    SP[Subprocess 'booking']
    BF[BookFlight - service]
    BH[BookHotel - service]
    BC[BookCar - service]
    OE[On error: compensate<br/>LIFO 3 service tasks]
    SP --> BF
    SP --> BH
    SP --> BC
    SP -.-> OE

Compensation, subprocess. Stress test para el motor de compensation.

Workload W4: "High Fan-Out"¶

flowchart LR
    S([StartEvent])
    PGS{{Parallel Gateway<br/>split into 50 branches}}
    ST[50x ServiceTask<br/>parallel]
    PGM{{Parallel Gateway<br/>merge}}
    E([EndEvent])
    S --> PGS --> ST --> PGM --> E

Stress test para parallel tokens y joining.

Workload W5: "Long Variable Payload"¶

flowchart LR
    S([StartEvent])
    ST1[ServiceTask<br/>variable: JSON 100KB]
    ST2[ServiceTask<br/>variable: JSON 100KB merged]
    E([EndEvent])
    S --> ST1 --> ST2 --> E

Stress test para variable storage (TOAST, JSONB indexes).

Workload W6: "Multi-Tenant Mix"¶

50 tenants, cada uno con W1 + W2 + W3 mezclados.
Ratio: 70% W1, 20% W2, 10% W3.

Realista. Mide fairness, RLS overhead, partition contention.

Targets de performance (M1-M4)¶

Workload	Métrica	M1 target	M2 target	M3 target
W1 single instance	E2E latency p50	300ms	250ms	200ms
W1 single instance	E2E latency p99	1s	800ms	600ms
W1 throughput	instances/sec	100	500	2000
W1 throughput	commands/sec	500	2500	10000
W4 fan-out	join latency	500ms	300ms	200ms
W5 big vars	E2E latency	800ms	600ms	500ms
W6 mixed	error rate	0.01%	0.01%	0.005%
Engine CPU @ target	utilization	< 70%	< 60%	< 50%
Postgres CPU @ target	utilization	< 70%	< 60%	< 50%

Cualquier cambio que regrese > 10% en cualquier métrica: bloqueante para merge.

Stack de herramientas¶

k6 — load testing principal¶

JavaScript scripts, Go runtime. Excelente para HTTP APIs.

// load/w1-order-flow.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
    scenarios: {
        sustained: {
            executor: 'constant-arrival-rate',
            rate: 100,            // 100 RPS
            timeUnit: '1s',
            duration: '10m',
            preAllocatedVUs: 50,
            maxVUs: 200,
        },
    },
    thresholds: {
        'http_req_duration{name:start_instance}': ['p(99)<500'],
        'http_req_failed': ['rate<0.01'],
    },
};

const BASE = __ENV.WF_BASE_URL || 'http://localhost:8080';
const TOKEN = __ENV.WF_TOKEN;

export default function () {
    const headers = {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${TOKEN}`,
    };

    const payload = JSON.stringify({
        processDefinitionId: 'order-flow',
        variables: {
            orderId: `O-${__VU}-${__ITER}`,
            amount: Math.random() * 1000,
            customerId: `cust-${Math.floor(Math.random()*10000)}`,
        },
        awaitCompletion: true,
        timeout: '30s',
    });

    const res = http.post(`${BASE}/api/v1/instances`, payload, {
        headers,
        tags: { name: 'start_instance' },
    });

    check(res, {
        'status 200/201': (r) => r.status === 200 || r.status === 201,
        'completed in body': (r) => JSON.parse(r.body).status === 'COMPLETED',
    });

    sleep(0.1);
}

Run:

k6 run --out influxdb=http://influx:8086/k6 \
    -e WF_BASE_URL=https://wf-staging.example.com \
    -e WF_TOKEN=$TOKEN \
    load/w1-order-flow.js

Workers sintéticos¶

Necesitamos workers para que los procesos completen. Workers sintéticos en Go con latencia configurable:

// loadworkers/main.go
func main() {
    client, _ := wfclient.New(os.Getenv("WF_URL"))
    defer client.Close()

    // Worker que simula trabajo
    client.RegisterWorker(wfclient.WorkerOptions{
        Type:        "validate-order",
        Concurrency: 32,
    }, func(ctx context.Context, job *wfclient.Job) (map[string]any, error) {
        // Simulate work: 50ms p50, 200ms p99 (lognormal)
        simulateWork(50*time.Millisecond, 200*time.Millisecond)
        return map[string]any{"valid": true}, nil
    })

    client.RegisterWorker(wfclient.WorkerOptions{
        Type:        "charge-payment",
        Concurrency: 16,
    }, func(ctx context.Context, job *wfclient.Job) (map[string]any, error) {
        simulateWork(200*time.Millisecond, 800*time.Millisecond)
        // 1% error rate (BPMN error)
        if rand.Float64() < 0.01 {
            return nil, wfclient.BPMNError("insufficient-funds", "")
        }
        return map[string]any{"chargeId": uuid.New().String()}, nil
    })

    client.Run(context.Background())
}

pgbench / pg_stat_statements¶

Para identificar bottlenecks en Postgres:

-- Top 10 queries by total time
SELECT query, calls, total_exec_time, mean_exec_time, rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Lock contention
SELECT blocked_locks.pid AS blocked_pid,
       blocking_locks.pid AS blocking_pid,
       blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON ...

perf / async-profiler (Linux)¶

Para profiling de hot paths del engine:

# CPU flame graph
perf record -F 99 -p $(pidof wf-engine) -- sleep 30
perf script | flamegraph.pl > flame.svg

# Go pprof
curl http://wf-engine:6060/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof -http :8081 cpu.prof

Telemetry continuo (durante load test)¶

Prometheus + Grafana dashboards. Mientras corre el load: - CPU, memory, GC del engine. - Postgres: locks, cache hit ratio, IOPS, WAL rate. - App: latencias p50/p95/p99, error rate, throughput.

Baseline tracking¶

Cada PR debe NO regredir vs baseline. Cómo:

Continuous benchmarking (CI)¶

# .github/workflows/perf.yml
on:
  pull_request:
    branches: [main]

jobs:
  perf-regression:
    runs-on: self-hosted-perf  # bare-metal, dedicated
    steps:
      - name: Setup
        run: ./scripts/setup-perf-env.sh

      - name: Run baseline (main)
        run: |
          git checkout main
          ./scripts/run-bench.sh > baseline.json

      - name: Run PR
        run: |
          git checkout $PR_BRANCH
          ./scripts/run-bench.sh > pr.json

      - name: Compare
        run: |
          ./scripts/compare-bench.py baseline.json pr.json \
              --threshold 0.10 --fail-on-regression

Output:

W1 throughput:        100 → 95 instances/sec  (-5.0%)  ✅ within threshold
W1 latency p99:       820ms → 920ms          (+12.2%) ❌ REGRESSION

Long-running tracking (Grafana)¶

Daily perf job en staging: - Mide W1/W4/W6 sostained 1h. - Publica métricas a Prometheus. - Dashboard: throughput / latencia día a día.

Si algo cambió pero no causó CI failure (e.g., behavioral change con feature flag), aparece en el dashboard.

Capacity planning¶

Saturation curves¶

Encontrar el "knee" donde latencia explota:

RPS:   50   100  200  500  1000  1500  2000
p99:   200  220  240  280  350   480   2500  ← knee aquí

Operar a ~70% del knee para tener buffer.

Resource model¶

A partir de mediciones, derivar modelo:

CPU usage (engine) = 0.5 ms/command × command_rate
Memory (engine) = 200 MB base + 50 KB × active_instances
Postgres CPU = 0.3 ms/command × command_rate × 1.2 (overhead)
Postgres connections = num_engine_nodes × pool_size
Postgres WAL = 5 KB/command × command_rate
Postgres disk growth = 8 KB/command × command_rate (data + audit)

Calculator:

def capacity(target_rps_commands, instance_count_active):
    engine_cpu_cores = target_rps_commands * 0.0005 / 0.7  # 70% utilization
    engine_memory_mb = 200 + 50/1024 * instance_count_active

    pg_cpu_cores = target_rps_commands * 0.0003 * 1.2 / 0.7
    pg_iops_writes = target_rps_commands * 2
    pg_disk_gb_per_day = target_rps_commands * 8e-6 * 86400

    return {
        'engine_nodes': max(2, math.ceil(engine_cpu_cores / 4)),  # 4-core nodes
        'engine_memory_per_node_gb': math.ceil(engine_memory_mb / num_nodes / 1024),
        'postgres_size': postgres_sku(pg_cpu_cores, pg_iops_writes),
        'storage_growth_gb_per_day': pg_disk_gb_per_day,
    }

Métricas SLI vs metrics noise¶

SLI (Service Level Indicators) — métricas que reflejan UX:

Process latency p99 (start → complete, happy path) < 1s
API availability: 2xx/all > 99.9%
Job activation latency p99 (job created → activated) < 100ms
Incident creation rate: < 0.1% de jobs

Metrics de salud interna — para diagnóstico, no SLO:

CPU, memory, GC
Postgres locks, cache hit
Channel/queue depths

No confundir: las internas pueden tener "wobble" sin que afecte UX.

Load test cadence¶

Tipo	Cuándo	Duración	Workload
Smoke	Cada PR	5 min	W1 light
Regression	Cada PR	15 min	W1, W4
Stress	Nightly	1h	W1-W6 mix
Soak	Weekly	24h	W6
Spike	Pre-release	30 min	W1 con spike 10×
Chaos	Monthly	2h	W6 + chaos-mesh

Soak test (24h)¶

Crítico para detectar: - Memory leaks - File descriptor leaks - Goroutine leaks - DB connection leaks - Disk slow-growing (audit log compaction insufficient) - Eventual GC pressure - Resource fragmentation

# Soak con baseline rate constant
k6 run --duration 24h --vus 100 load/w6-mixed.js

# Track durante 24h:
# - Memory en crecimiento? → leak
# - Latencia degradando? → fragmentation
# - GC time creciendo? → memory pressure

Reporte de un load test¶

# Load test report: W1 sustained 1h, M1 staging

## Configuration
- Engine: 2× 4-vCPU 8GB
- Postgres: db.r6g.xlarge (4 vCPU 32GB)
- Workload: W1 at 200 RPS

## Results
| Metric | Target | Actual | Status |
|---|---|---|---|
| Throughput | 200 inst/s | 198 | ✅ |
| Latency p50 | 200ms | 185ms | ✅ |
| Latency p99 | 600ms | 920ms | ❌ |
| Engine CPU | <70% | 65% | ✅ |
| Postgres CPU | <70% | 82% | ❌ |
| Error rate | <0.1% | 0.05% | ✅ |

## Bottleneck analysis
- Postgres CPU 82%: top query is `INSERT INTO commands ...` (45% time)
- pg_stat_statements shows mean_exec_time 1.2ms (target <1ms)
- WAL flushing limited by single fsync per commit

## Action items
- [ ] Tune Postgres: increase shared_buffers
- [ ] Batch commits in engine (group commit pattern)
- [ ] Profile worker connection pool

Hardware recommendations por phase¶

Ver sizing benchmarks para detalle. Resumen:

Phase	Engine	Postgres	Notas
M1 (dev/test)	1× 2-vCPU 4GB	1× 2-vCPU 8GB SSD	Single node
M2 (small prod)	2× 4-vCPU 8GB	2× 4-vCPU 32GB SSD	Patroni HA
M3 (mid prod)	3× 8-vCPU 16GB	3× 8-vCPU 64GB NVMe	Multi-AZ
M4 (large prod)	6× 16-vCPU 32GB	Citus 3+3 workers, 16-vCPU 128GB	Sharded

Anti-patterns en load tests¶

❌ Ramp instantáneo (cold cache)¶

// MAL
options = { vus: 200, duration: '10m' }
// Primer minuto: caché frío, latencia inflada

// BIEN
options = {
    stages: [
        { duration: '2m', target: 200 },  // ramp-up
        { duration: '10m', target: 200 }, // sustained measurement
    ],
}

❌ Workers undersized¶

Engine procesa 1000 jobs/s pero workers solo 100/s.
Latencia inflada porque jobs encolan.
Estás midiendo el throughput del worker, no del engine.

❌ Tester en misma red que el engine¶

Network latency dominates; no se ve el verdadero comportamiento del engine.

❌ Sin reset entre runs¶

Postgres state acumula. Cada run debe partir de estado conocido.

Roadmap¶

M1: k6 scripts W1, baseline manual, Prometheus.
M2: CI continuous benchmarking, soak nightly.
M3: Capacity calculator integrado en CLI (wf capacity --target-rps 500).
M4: Geo-distributed load tests.

Referencias¶

sizing benchmarks — capacity sizing real
intuit production benchmarks — caso real Intuit
observability deep dive — métricas durante load
microbenchmark methodology — micro-benchmarks unitarios
k6 docs
Latency tip of the iceberg (Gil Tene) — percentile pitfalls