Saltar a contenido

Observability Deep Dive

Observability spec del MVP. 50+ metrics business-specific con labels estandarizados. Structured logging con sampling/redaction. Distributed traces con span design. Alert design: golden signals + business KPIs + saturation. SLI/SLO definitions: 99% availability, 99% commands < 1s, 99.9% no data loss. Error budget calculation. Runbook automation. RED y USE methods aplicados. Costs: $1K-10K/month APM dependiendo de scale.

Observability vs monitoring

Monitoring: predetermined questions (CPU, memory, request count) Observability: ask new questions (why is THIS user's request slow?)

MVP needs both. Strategy: OpenTelemetry as instrumentation + APM as backend (ADR-010, ADR-011).

The three pillars

Metrics — numerical aggregates

Best for: dashboards, alerts, trends, capacity planning. Volume: low (per second aggregations). Cost: cheap to store, fast to query.

Logs — structured events

Best for: forensics, debugging specific issues, audit. Volume: medium to high. Cost: medium ($0.50-2/GB depending on retention).

Traces — request flow

Best for: debugging latency, understanding flow. Volume: high (sampled). Cost: medium (sampling reduces).

Use all three. Each answers different questions.

Metrics catalog

Per adrs/adr-011-opentelemetry-instrumentation, all metrics emitted via OTel.

Business metrics (workflow-specific)

Metric Type Labels Purpose
workflow.process_instances.created.total counter tenant, bpmn_process_id Throughput
workflow.process_instances.completed.total counter tenant, bpmn_process_id, outcome Success rate
workflow.process_instances.canceled.total counter tenant, bpmn_process_id, reason Cancellation tracking
workflow.process_instances.duration histogram tenant, bpmn_process_id End-to-end time
workflow.process_instances.active gauge tenant, bpmn_process_id In-flight count
workflow.process_instances.event_count histogram tenant Detect runaway processes
workflow.jobs.activated.total counter tenant, job_type Worker activity
workflow.jobs.completed.total counter tenant, job_type Worker success
workflow.jobs.failed.total counter tenant, job_type, error_type Worker errors
workflow.jobs.duration histogram tenant, job_type Worker latency
workflow.jobs.activation.queue_depth gauge job_type Job backlog
workflow.jobs.timeout.total counter tenant, job_type Worker hangs
workflow.incidents.created.total counter tenant, error_type, bpmn_process_id Incident rate
workflow.incidents.resolved.total counter tenant, error_type Resolution rate
workflow.incidents.active gauge tenant Incident backlog
workflow.user_tasks.created.total counter tenant, bpmn_process_id Task creation
workflow.user_tasks.completed.total counter tenant, bpmn_process_id, assignee_pattern Task completion
workflow.user_tasks.duration histogram tenant How long tasks pending
workflow.user_tasks.active gauge tenant Pending tasks
workflow.timers.scheduled.total counter tenant Timer activity
workflow.timers.fired.total counter tenant Timer fires
workflow.messages.published.total counter tenant, message_name Message activity
workflow.messages.correlated.total counter tenant, message_name, outcome Correlation success
workflow.deployments.created.total counter tenant Deploy frequency

Engine internal metrics

Metric Type Labels Purpose
engine.commands.processed.total counter intent, tenant Engine throughput
engine.commands.processing.duration histogram intent Engine latency
engine.commands.queue.depth gauge - Queue saturation
engine.commands.queue.depth.percent gauge - Saturation 0-100%
engine.commands.rejected.total counter reason Backpressure
engine.replay.position gauge - Catch-up tracking
engine.cache.process_definitions.hit_rate gauge - Cache effectiveness
engine.cache.size gauge cache_type Memory tracking

System metrics (standard)

Metric Type Source
process.cpu.utilization gauge Process
process.memory.usage gauge Process
process.threads.count gauge Process
go.gc.duration (or JVM equiv) histogram Runtime
go.goroutines.count gauge Runtime
http.server.duration histogram HTTP server
http.server.requests.total counter HTTP server
db.connections.active gauge DB pool
db.connections.idle gauge DB pool
db.connections.waiting gauge DB pool
db.query.duration histogram DB queries

PostgreSQL metrics (via postgres_exporter)

Metric Purpose
pg_stat_database_* DB-level stats
pg_replication_lag_bytes Replication health
pg_stat_statements_* Top slow queries
pg_stat_user_tables_* Table activity
pg_stat_bgwriter_* Background writer
pg_locks_* Lock contention
pg_settings_* Configuration

Cardinality management

Labels combinatorial explosion = metric storage explosion. Rules:

High cardinality (AVOID en labels)

  • process_instance_key — millions of values
  • job_key — millions
  • user_id — thousands+
  • business_id — user-controlled

These belong in logs/traces, NOT metrics.

Acceptable cardinality (OK en labels)

  • tenant_id — typically < 10,000
  • bpmn_process_id — typically < 100 per tenant
  • job_type — typically < 100 per tenant
  • error_type — < 20

Calculate before adding label:

distinct_label_values × other_label_values × number_of_metrics
should be < ~1M unique series

Logs strategy

Structured logging mandatory

NO printf style. JSON always:

{
  "timestamp": "2025-05-14T12:34:56.789Z",
  "level": "INFO",
  "logger": "engine.processor",
  "trace_id": "abc123def456...",
  "span_id": "78901234abcd...",
  "tenant_id": "acme",
  "process_instance_key": 12345,
  "intent": "ELEMENT_COMPLETED",
  "element_id": "task1",
  "duration_ms": 23,
  "message": "Element completed successfully"
}

Required fields: - timestamp (ISO 8601 UTC) - level (DEBUG, INFO, WARN, ERROR, FATAL) - logger (component name) - message (human-readable) - trace_id (correlation con distributed trace)

Optional but recommended: - span_id - tenant_id - Resource IDs (when applicable)

Log levels

Level Use for Volume
DEBUG Detailed flow, dev-only Off in prod
INFO Routine operations Sampled in prod (1%)
WARN Recoverable issues All in prod
ERROR Failures, incidents All in prod
FATAL About to die All in prod

Sampling strategy

For high-volume INFO logs, sample:

import random

def log_info_sampled(msg, sample_rate=0.01, **fields):
    if random.random() < sample_rate or fields.get('important'):
        logger.info(msg, **fields)

# For critical paths: always log
log_info_sampled("Process completed", sample_rate=1.0, important=True, ...)

# For routine: sample
log_info_sampled("Variable read", sample_rate=0.001, ...)

Always-log: ERROR, WARN, security events, audit-relevant.

Redaction

Per analysis/security-threat-model T4.3:

SENSITIVE_KEYS = re.compile(r'password|token|secret|api_?key|ssn|credit', re.I)

def redact(obj):
    if isinstance(obj, dict):
        return {k: '[REDACTED]' if SENSITIVE_KEYS.search(k) else redact(v) 
                for k, v in obj.items()}
    if isinstance(obj, list):
        return [redact(v) for v in obj]
    return obj

Applied automatically before logging.

Log retention

Log type Hot retention Cold retention
ERROR/FATAL 30 days 1 year
WARN 14 days 90 days
INFO sampled 7 days 30 days
DEBUG Off in prod -
Audit log 30 days hot 7 years (compliance)

Storage backends: - Hot: Loki/Elasticsearch/Splunk - Cold: S3 with Athena queries

Distributed tracing

Span design

For each "operation" create a span:

with tracer.start_as_current_span("process_command") as span:
    span.set_attribute("workflow.tenant.id", tenant_id)
    span.set_attribute("workflow.command.intent", intent)
    span.set_attribute("workflow.process.id", bpmn_process_id)

    # Process...

    span.set_attribute("workflow.command.success", True)

Span hierarchy

HTTP request (parent span)
├── auth.validate_token
├── api.handler.create_instance
│   ├── db.command_log.insert
│   └── notify.engine
│       └── engine.process_command (could be separate trace if async)
│           ├── db.read_state
│           ├── bpmn.parse (cached usually)
│           ├── bpmn.process
│           │   ├── behavior.variable_mapping
│           │   ├── behavior.job_creation
│           │   └── ...
│           └── db.write_state_atomic

Workers continue trace:

worker.activate_jobs
├── http.request to engine
└── http.response (with trace context propagated back)

worker.handler.send_email
├── (continuation of process_command trace via baggage)
├── http.request to email provider
└── worker.complete_job
    └── http.request to engine

End-to-end visibility from user request → worker → completion.

Sampling

100% sampling = expensive. Strategies:

# Tail-based sampling: keep all if request was slow or errored
def should_sample(trace):
    return (
        trace.duration_ms > 1000 or          # Slow
        trace.has_error or                    # Failed
        trace.tenant in PREMIUM_TENANTS or    # VIP
        random.random() < 0.01                # 1% baseline
    )

Most APMs handle this server-side.

Trace context propagation

W3C Trace Context standard:

GET /v2/jobs/12345 HTTP/1.1
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: vendor1=...

OTel SDK handles automatically.

Span attributes conventions

Per OTel semantic conventions:

  • HTTP: http.method, http.url, http.status_code
  • DB: db.system, db.statement, db.user
  • Workflow custom: workflow.tenant.id, workflow.process.id, etc.

Document workflow.* conventions explicitly for consistency.

Alerting design

Golden signals (SRE book)

Always alert on these:

  1. Latency: TP99 > threshold for N min
  2. Traffic: TPS unusual (anomaly)
  3. Errors: Error rate > threshold
  4. Saturation: Resource utilization high

For MVP engine:

# Latency
- alert: HighRequestLatency
  expr: histogram_quantile(0.99, http_server_duration_bucket) > 1.0
  for: 5m
  severity: warning

# Traffic anomaly
- alert: TrafficDropSudden
  expr: (rate(http_server_requests_total[5m])) < (rate(http_server_requests_total[1h] offset 1h)) * 0.5
  for: 10m
  severity: warning

# Error rate
- alert: HighErrorRate
  expr: rate(http_server_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m
  severity: critical

# Saturation
- alert: EngineQueueSaturating
  expr: engine_commands_queue_depth_percent > 80
  for: 2m
  severity: warning

- alert: EngineQueueCritical
  expr: engine_commands_queue_depth_percent > 95
  for: 1m
  severity: critical

Business KPI alerts

- alert: ProcessInstanceCreationDropped
  expr: rate(workflow_process_instances_created_total[15m]) < 1
  for: 30m
  severity: warning
  annotations:
    summary: "Process creation rate unusual"

- alert: IncidentRateHigh
  expr: rate(workflow_incidents_created_total[5m]) > 1
  for: 5m
  severity: warning

Failure-mode alerts

Per analysis/failure-mode-analysis, specific alerts:

# F1: Engine crash
- alert: EngineDown
  expr: up{job="workflow-engine"} == 0
  for: 1m
  severity: critical

# F2: Processing stuck
- alert: EngineProcessingStuck
  expr: rate(engine_commands_processed_total[5m]) == 0
  for: 5m
  severity: critical

# F5: DB primary failed
- alert: PostgresPrimaryDown
  expr: pg_up{role="primary"} == 0
  for: 30s
  severity: critical

# F6: Replication lag
- alert: PostgresReplicationLag
  expr: pg_replication_lag_bytes > 100e6
  for: 5m
  severity: warning

# F7: Disk full
- alert: DiskFullCritical
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
  for: 5m
  severity: critical

Alert routing

# Routes
routes:
  - match: { severity: critical }
    receiver: pagerduty
    continue: true
  - match: { severity: critical }
    receiver: slack-critical
  - match: { severity: warning }
    receiver: slack-warnings
  - match: { team: db }
    receiver: dba-team

Alert fatigue prevention

  • Aggregation: same alert from multiple instances → one notification
  • Inhibition: critical suppresses warning
  • Time-based silences: maintenance windows
  • De-duplication: state-based, not edge-based

SLI/SLO definitions

SLI (Service Level Indicators)

What we measure:

SLI Definition
Availability % of requests that succeeded
Latency % of requests faster than 1s
Throughput TPS sustained
Data integrity % of replay determinism tests passed
Incident resolution % of incidents resolved within 1 hour

SLO (Service Level Objectives)

What we commit:

Phase SLO
Phase 0 99% availability monthly
Phase 0 95% requests < 1s
Phase 0 100% no data loss
Phase 1 99.9% availability
Phase 1 99% requests < 1s
Phase 2 99.95% availability
Phase 2 99.5% requests < 500ms

Error budget

SLO 99% = 1% unavailability allowed
1% of 30 days = 7.2 hours/month error budget

If burning: slow down feature releases, focus on reliability
If under: invest in features (some budget OK to spend)
# Error budget remaining
1 - (
  (1 - sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d])))
  / 0.99  # SLO target
)

Track in Grafana, alert when error budget burning fast.

SLO burn rate alerts

# Burning error budget fast
- alert: ErrorBudgetBurnFast
  expr: |
    (
      slo_error_rate_5m > (14.4 * (1 - 0.99))  # 14.4x = burns 7d budget in 1h
    )
  for: 2m
  severity: critical

- alert: ErrorBudgetBurnMedium
  expr: |
    (
      slo_error_rate_1h > (6 * (1 - 0.99))    # 6x = burns 7d budget in 6h
    )
  for: 15m
  severity: warning

Multi-window approach (Google SRE book).

Dashboards design

Dashboard 1: Platform Health (5-min refresh)

For SREs at-a-glance: - 4 SLI panels (top row, big numbers) - Engine TPS / Latency / Errors / Queue - DB connections / Replication lag - Resources (CPU, Memory, Disk)

Dashboard 2: Business KPIs

For ops/PM: - Process instances created per hour (by process) - Completion rate - Incident rate (by type) - User task creation/completion rates - Average process duration - Top 10 processes by volume

Dashboard 3: Per-tenant

For customer success / billing: - Per-tenant TPS - Per-tenant resource usage - Per-tenant incident rate - Quota usage

Dashboard 4: Worker Performance

For dev teams: - Jobs per worker type - Worker latency (p50, p99) - Failure rate by job type - Activation queue depth

Dashboard 5: Postgres

For DBA: - Connections active/idle - Slow queries (pg_stat_statements) - Cache hit ratio - Replication lag - WAL stats - Top tables by size/growth

Cost optimization

Sample logs aggressively

INFO logs at 100K/sec × 1KB = 100 MB/sec = 8.6 TB/day = $$$ at storage. Sample to 1% = 86 GB/day. Save ERROR/WARN always.

Don't ingest debug logs to APM

DEBUG locally only. ALERT only on ERRORs.

Trace sampling

100% only for slow/error requests. Otherwise 1% baseline.

Metrics: avoid high-cardinality labels

process_instance_key label → millions of series → metrics DB explodes.

Retention tiered

Hot (recent): high-priced fast storage Warm (1 week): cheaper Cold (30+ days): S3 Glacier

Estimated costs

For MVP at ~100 TPS:

Component Cost (monthly)
Metrics (Prometheus or APM): 100K series × $0.10 $100
Logs (1% sampled): 86 GB/day × $0.50 = $1300/month $1300
Traces (tail-sampled): 1M spans/day × $0.001 $30
Total APM tooling ~$1500

At higher scale, costs scale roughly linear.

Runbook automation

Alerts link to runbooks:

- alert: HighRequestLatency
  annotations:
    runbook_url: "https://runbooks.mvp.dev/high-latency"

Runbook example:

# High Request Latency

## Detection
TP99 > 1s for 5 min

## Investigation
1. Check Grafana "Platform Health" dashboard
2. Look at top slow queries: `mvp-cli postgres slow-queries`
3. Check engine queue depth
4. Check DB replication lag

## Common causes + fixes
- DB query degradation → check pg_stat_statements
- Engine queue saturating → check why
- Worker overload → check job activation latency

## Escalation
After 30 min unresolved: page on-call engineer