Observability Deep Dive

Observability spec del MVP. 50+ metrics business-specific con labels estandarizados. Structured logging con sampling/redaction. Distributed traces con span design. Alert design: golden signals + business KPIs + saturation. SLI/SLO definitions: 99% availability, 99% commands < 1s, 99.9% no data loss. Error budget calculation. Runbook automation. RED y USE methods aplicados. Costs: $1K-10K/month APM dependiendo de scale.

Observability vs monitoring¶

Monitoring: predetermined questions (CPU, memory, request count) Observability: ask new questions (why is THIS user's request slow?)

MVP needs both. Strategy: OpenTelemetry as instrumentation + APM as backend (ADR-010, ADR-011).

The three pillars¶

Metrics — numerical aggregates¶

Best for: dashboards, alerts, trends, capacity planning. Volume: low (per second aggregations). Cost: cheap to store, fast to query.

Logs — structured events¶

Best for: forensics, debugging specific issues, audit. Volume: medium to high. Cost: medium ($0.50-2/GB depending on retention).

Traces — request flow¶

Best for: debugging latency, understanding flow. Volume: high (sampled). Cost: medium (sampling reduces).

Use all three. Each answers different questions.

Metrics catalog¶

Per adr 011 opentelemetry instrumentation, all metrics emitted via OTel.

Business metrics (workflow-specific)¶

Metric	Type	Labels	Purpose
`workflow.process_instances.created.total`	counter	tenant, bpmn_process_id	Throughput
`workflow.process_instances.completed.total`	counter	tenant, bpmn_process_id, outcome	Success rate
`workflow.process_instances.canceled.total`	counter	tenant, bpmn_process_id, reason	Cancellation tracking
`workflow.process_instances.duration`	histogram	tenant, bpmn_process_id	End-to-end time
`workflow.process_instances.active`	gauge	tenant, bpmn_process_id	In-flight count
`workflow.process_instances.event_count`	histogram	tenant	Detect runaway processes
`workflow.jobs.activated.total`	counter	tenant, job_type	Worker activity
`workflow.jobs.completed.total`	counter	tenant, job_type	Worker success
`workflow.jobs.failed.total`	counter	tenant, job_type, error_type	Worker errors
`workflow.jobs.duration`	histogram	tenant, job_type	Worker latency
`workflow.jobs.activation.queue_depth`	gauge	job_type	Job backlog
`workflow.jobs.timeout.total`	counter	tenant, job_type	Worker hangs
`workflow.incidents.created.total`	counter	tenant, error_type, bpmn_process_id	Incident rate
`workflow.incidents.resolved.total`	counter	tenant, error_type	Resolution rate
`workflow.incidents.active`	gauge	tenant	Incident backlog
`workflow.user_tasks.created.total`	counter	tenant, bpmn_process_id	Task creation
`workflow.user_tasks.completed.total`	counter	tenant, bpmn_process_id, assignee_pattern	Task completion
`workflow.user_tasks.duration`	histogram	tenant	How long tasks pending
`workflow.user_tasks.active`	gauge	tenant	Pending tasks
`workflow.timers.scheduled.total`	counter	tenant	Timer activity
`workflow.timers.fired.total`	counter	tenant	Timer fires
`workflow.messages.published.total`	counter	tenant, message_name	Message activity
`workflow.messages.correlated.total`	counter	tenant, message_name, outcome	Correlation success
`workflow.deployments.created.total`	counter	tenant	Deploy frequency

Engine internal metrics¶

Metric	Type	Labels	Purpose
`engine.commands.processed.total`	counter	intent, tenant	Engine throughput
`engine.commands.processing.duration`	histogram	intent	Engine latency
`engine.commands.queue.depth`	gauge	-	Queue saturation
`engine.commands.queue.depth.percent`	gauge	-	Saturation 0-100%
`engine.commands.rejected.total`	counter	reason	Backpressure
`engine.replay.position`	gauge	-	Catch-up tracking
`engine.cache.process_definitions.hit_rate`	gauge	-	Cache effectiveness
`engine.cache.size`	gauge	cache_type	Memory tracking

System metrics (standard)¶

Metric	Type	Source
`process.cpu.utilization`	gauge	Process
`process.memory.usage`	gauge	Process
`process.threads.count`	gauge	Process
`go.gc.duration` (or JVM equiv)	histogram	Runtime
`go.goroutines.count`	gauge	Runtime
`http.server.duration`	histogram	HTTP server
`http.server.requests.total`	counter	HTTP server
`db.connections.active`	gauge	DB pool
`db.connections.idle`	gauge	DB pool
`db.connections.waiting`	gauge	DB pool
`db.query.duration`	histogram	DB queries

PostgreSQL metrics (via postgres_exporter)¶

Metric	Purpose
`pg_stat_database_*`	DB-level stats
`pg_replication_lag_bytes`	Replication health
`pg_stat_statements_*`	Top slow queries
`pg_stat_user_tables_*`	Table activity
`pg_stat_bgwriter_*`	Background writer
`pg_locks_*`	Lock contention
`pg_settings_*`	Configuration

Cardinality management¶

Labels combinatorial explosion = metric storage explosion. Rules:

High cardinality (AVOID en labels)¶

process_instance_key — millions of values
job_key — millions
user_id — thousands+
business_id — user-controlled

These belong in logs/traces, NOT metrics.

Acceptable cardinality (OK en labels)¶

tenant_id — typically < 10,000
bpmn_process_id — typically < 100 per tenant
job_type — typically < 100 per tenant
error_type — < 20

Calculate before adding label:

distinct_label_values × other_label_values × number_of_metrics
should be < ~1M unique series

Logs strategy¶

Structured logging mandatory¶

NO printf style. JSON always:

{
  "timestamp": "2025-05-14T12:34:56.789Z",
  "level": "INFO",
  "logger": "engine.processor",
  "trace_id": "abc123def456...",
  "span_id": "78901234abcd...",
  "tenant_id": "acme",
  "process_instance_key": 12345,
  "intent": "ELEMENT_COMPLETED",
  "element_id": "task1",
  "duration_ms": 23,
  "message": "Element completed successfully"
}

Required fields: - timestamp (ISO 8601 UTC) - level (DEBUG, INFO, WARN, ERROR, FATAL) - logger (component name) - message (human-readable) - trace_id (correlation con distributed trace)

Optional but recommended: - span_id - tenant_id - Resource IDs (when applicable)

Log levels¶

Level	Use for	Volume
DEBUG	Detailed flow, dev-only	Off in prod
INFO	Routine operations	Sampled in prod (1%)
WARN	Recoverable issues	All in prod
ERROR	Failures, incidents	All in prod
FATAL	About to die	All in prod

Sampling strategy¶

For high-volume INFO logs, sample:

import random

def log_info_sampled(msg, sample_rate=0.01, **fields):
    if random.random() < sample_rate or fields.get('important'):
        logger.info(msg, **fields)

# For critical paths: always log
log_info_sampled("Process completed", sample_rate=1.0, important=True, ...)

# For routine: sample
log_info_sampled("Variable read", sample_rate=0.001, ...)

Always-log: ERROR, WARN, security events, audit-relevant.

Redaction¶

Per security threat model T4.3:

SENSITIVE_KEYS = re.compile(r'password|token|secret|api_?key|ssn|credit', re.I)

def redact(obj):
    if isinstance(obj, dict):
        return {k: '[REDACTED]' if SENSITIVE_KEYS.search(k) else redact(v) 
                for k, v in obj.items()}
    if isinstance(obj, list):
        return [redact(v) for v in obj]
    return obj

Applied automatically before logging.

Log retention¶

Log type	Hot retention	Cold retention
ERROR/FATAL	30 days	1 year
WARN	14 days	90 days
INFO sampled	7 days	30 days
DEBUG	Off in prod	-
Audit log	30 days hot	7 years (compliance)

Storage backends: - Hot: Loki/Elasticsearch/Splunk - Cold: S3 with Athena queries

Distributed tracing¶

Span design¶

For each "operation" create a span:

with tracer.start_as_current_span("process_command") as span:
    span.set_attribute("workflow.tenant.id", tenant_id)
    span.set_attribute("workflow.command.intent", intent)
    span.set_attribute("workflow.process.id", bpmn_process_id)

    # Process...

    span.set_attribute("workflow.command.success", True)

Span hierarchy¶

HTTP request (parent span)
├── auth.validate_token
├── api.handler.create_instance
│   ├── db.command_log.insert
│   └── notify.engine
│       └── engine.process_command (could be separate trace if async)
│           ├── db.read_state
│           ├── bpmn.parse (cached usually)
│           ├── bpmn.process
│           │   ├── behavior.variable_mapping
│           │   ├── behavior.job_creation
│           │   └── ...
│           └── db.write_state_atomic

Workers continue trace:

worker.activate_jobs
├── http.request to engine
└── http.response (with trace context propagated back)

worker.handler.send_email
├── (continuation of process_command trace via baggage)
├── http.request to email provider
└── worker.complete_job
    └── http.request to engine

End-to-end visibility from user request → worker → completion.

Sampling¶

100% sampling = expensive. Strategies:

# Tail-based sampling: keep all if request was slow or errored
def should_sample(trace):
    return (
        trace.duration_ms > 1000 or          # Slow
        trace.has_error or                    # Failed
        trace.tenant in PREMIUM_TENANTS or    # VIP
        random.random() < 0.01                # 1% baseline
    )

Most APMs handle this server-side.

Trace context propagation¶

W3C Trace Context standard:

GET /v2/jobs/12345 HTTP/1.1
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: vendor1=...

OTel SDK handles automatically.

Span attributes conventions¶

Per OTel semantic conventions:

HTTP: http.method, http.url, http.status_code
DB: db.system, db.statement, db.user
Workflow custom: workflow.tenant.id, workflow.process.id, etc.

Document workflow.* conventions explicitly for consistency.

Alerting design¶

Golden signals (SRE book)¶

Always alert on these:

Latency: TP99 > threshold for N min
Traffic: TPS unusual (anomaly)
Errors: Error rate > threshold
Saturation: Resource utilization high

For MVP engine:

# Latency
- alert: HighRequestLatency
  expr: histogram_quantile(0.99, http_server_duration_bucket) > 1.0
  for: 5m
  severity: warning

# Traffic anomaly
- alert: TrafficDropSudden
  expr: (rate(http_server_requests_total[5m])) < (rate(http_server_requests_total[1h] offset 1h)) * 0.5
  for: 10m
  severity: warning

# Error rate
- alert: HighErrorRate
  expr: rate(http_server_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m
  severity: critical

# Saturation
- alert: EngineQueueSaturating
  expr: engine_commands_queue_depth_percent > 80
  for: 2m
  severity: warning

- alert: EngineQueueCritical
  expr: engine_commands_queue_depth_percent > 95
  for: 1m
  severity: critical

Business KPI alerts¶

- alert: ProcessInstanceCreationDropped
  expr: rate(workflow_process_instances_created_total[15m]) < 1
  for: 30m
  severity: warning
  annotations:
    summary: "Process creation rate unusual"

- alert: IncidentRateHigh
  expr: rate(workflow_incidents_created_total[5m]) > 1
  for: 5m
  severity: warning

Failure-mode alerts¶

Per failure mode analysis, specific alerts:

# F1: Engine crash
- alert: EngineDown
  expr: up{job="workflow-engine"} == 0
  for: 1m
  severity: critical

# F2: Processing stuck
- alert: EngineProcessingStuck
  expr: rate(engine_commands_processed_total[5m]) == 0
  for: 5m
  severity: critical

# F5: DB primary failed
- alert: PostgresPrimaryDown
  expr: pg_up{role="primary"} == 0
  for: 30s
  severity: critical

# F6: Replication lag
- alert: PostgresReplicationLag
  expr: pg_replication_lag_bytes > 100e6
  for: 5m
  severity: warning

# F7: Disk full
- alert: DiskFullCritical
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
  for: 5m
  severity: critical

Alert routing¶

# Routes
routes:
  - match: { severity: critical }
    receiver: pagerduty
    continue: true
  - match: { severity: critical }
    receiver: slack-critical
  - match: { severity: warning }
    receiver: slack-warnings
  - match: { team: db }
    receiver: dba-team

Alert fatigue prevention¶

Aggregation: same alert from multiple instances → one notification
Inhibition: critical suppresses warning
Time-based silences: maintenance windows
De-duplication: state-based, not edge-based

SLI/SLO definitions¶

SLI (Service Level Indicators)¶

What we measure:

SLI	Definition
Availability	% of requests that succeeded
Latency	% of requests faster than 1s
Throughput	TPS sustained
Data integrity	% of replay determinism tests passed
Incident resolution	% of incidents resolved within 1 hour

SLO (Service Level Objectives)¶

What we commit:

Phase	SLO
Phase 0	99% availability monthly
Phase 0	95% requests < 1s
Phase 0	100% no data loss
Phase 1	99.9% availability
Phase 1	99% requests < 1s
Phase 2	99.95% availability
Phase 2	99.5% requests < 500ms

Error budget¶

SLO 99% = 1% unavailability allowed
1% of 30 days = 7.2 hours/month error budget

If burning: slow down feature releases, focus on reliability
If under: invest in features (some budget OK to spend)

# Error budget remaining
1 - (
  (1 - sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d])))
  / 0.99  # SLO target
)

Track in Grafana, alert when error budget burning fast.

SLO burn rate alerts¶

# Burning error budget fast
- alert: ErrorBudgetBurnFast
  expr: |
    (
      slo_error_rate_5m > (14.4 * (1 - 0.99))  # 14.4x = burns 7d budget in 1h
    )
  for: 2m
  severity: critical

- alert: ErrorBudgetBurnMedium
  expr: |
    (
      slo_error_rate_1h > (6 * (1 - 0.99))    # 6x = burns 7d budget in 6h
    )
  for: 15m
  severity: warning

Multi-window approach (Google SRE book).

Dashboards design¶

Dashboard 1: Platform Health (5-min refresh)¶

For SREs at-a-glance: - 4 SLI panels (top row, big numbers) - Engine TPS / Latency / Errors / Queue - DB connections / Replication lag - Resources (CPU, Memory, Disk)

Dashboard 2: Business KPIs¶

For ops/PM: - Process instances created per hour (by process) - Completion rate - Incident rate (by type) - User task creation/completion rates - Average process duration - Top 10 processes by volume

Dashboard 3: Per-tenant¶

For customer success / billing: - Per-tenant TPS - Per-tenant resource usage - Per-tenant incident rate - Quota usage

Dashboard 4: Worker Performance¶

For dev teams: - Jobs per worker type - Worker latency (p50, p99) - Failure rate by job type - Activation queue depth

Dashboard 5: Postgres¶

For DBA: - Connections active/idle - Slow queries (pg_stat_statements) - Cache hit ratio - Replication lag - WAL stats - Top tables by size/growth

Cost optimization¶

Sample logs aggressively¶

INFO logs at 100K/sec × 1KB = 100 MB/sec = 8.6 TB/day = $$$ at storage. Sample to 1% = 86 GB/day. Save ERROR/WARN always.

Don't ingest debug logs to APM¶

DEBUG locally only. ALERT only on ERRORs.

Trace sampling¶

100% only for slow/error requests. Otherwise 1% baseline.

Metrics: avoid high-cardinality labels¶

process_instance_key label → millions of series → metrics DB explodes.

Retention tiered¶

Hot (recent): high-priced fast storage Warm (1 week): cheaper Cold (30+ days): S3 Glacier

Estimated costs¶

For MVP at ~100 TPS:

Component	Cost (monthly)
Metrics (Prometheus or APM): 100K series × $0.10	$100
Logs (1% sampled): 86 GB/day × $0.50 = $1300/month	$1300
Traces (tail-sampled): 1M spans/day × $0.001	$30
Total APM tooling	~$1500

At higher scale, costs scale roughly linear.

Runbook automation¶

Alerts link to runbooks:

- alert: HighRequestLatency
  annotations:
    runbook_url: "https://runbooks.mvp.dev/high-latency"

Runbook example:

# High Request Latency

## Detection
TP99 > 1s for 5 min

## Investigation
1. Check Grafana "Platform Health" dashboard
2. Look at top slow queries: `mvp-cli postgres slow-queries`
3. Check engine queue depth
4. Check DB replication lag

## Common causes + fixes
- DB query degradation → check pg_stat_statements
- Engine queue saturating → check why
- Worker overload → check job activation latency

## Escalation
After 30 min unresolved: page on-call engineer