Observability Deep Dive
Observability spec del MVP. 50+ metrics business-specific con labels estandarizados. Structured logging con sampling/redaction. Distributed traces con span design. Alert design: golden signals + business KPIs + saturation. SLI/SLO definitions: 99% availability, 99% commands < 1s, 99.9% no data loss. Error budget calculation. Runbook automation. RED y USE methods aplicados. Costs: $1K-10K/month APM dependiendo de scale.
Observability vs monitoring¶
Monitoring: predetermined questions (CPU, memory, request count) Observability: ask new questions (why is THIS user's request slow?)
MVP needs both. Strategy: OpenTelemetry as instrumentation + APM as backend (ADR-010, ADR-011).
The three pillars¶
Metrics — numerical aggregates¶
Best for: dashboards, alerts, trends, capacity planning. Volume: low (per second aggregations). Cost: cheap to store, fast to query.
Logs — structured events¶
Best for: forensics, debugging specific issues, audit. Volume: medium to high. Cost: medium ($0.50-2/GB depending on retention).
Traces — request flow¶
Best for: debugging latency, understanding flow. Volume: high (sampled). Cost: medium (sampling reduces).
Use all three. Each answers different questions.
Metrics catalog¶
Per adrs/adr-011-opentelemetry-instrumentation, all metrics emitted via OTel.
Business metrics (workflow-specific)¶
| Metric | Type | Labels | Purpose |
|---|---|---|---|
workflow.process_instances.created.total |
counter | tenant, bpmn_process_id | Throughput |
workflow.process_instances.completed.total |
counter | tenant, bpmn_process_id, outcome | Success rate |
workflow.process_instances.canceled.total |
counter | tenant, bpmn_process_id, reason | Cancellation tracking |
workflow.process_instances.duration |
histogram | tenant, bpmn_process_id | End-to-end time |
workflow.process_instances.active |
gauge | tenant, bpmn_process_id | In-flight count |
workflow.process_instances.event_count |
histogram | tenant | Detect runaway processes |
workflow.jobs.activated.total |
counter | tenant, job_type | Worker activity |
workflow.jobs.completed.total |
counter | tenant, job_type | Worker success |
workflow.jobs.failed.total |
counter | tenant, job_type, error_type | Worker errors |
workflow.jobs.duration |
histogram | tenant, job_type | Worker latency |
workflow.jobs.activation.queue_depth |
gauge | job_type | Job backlog |
workflow.jobs.timeout.total |
counter | tenant, job_type | Worker hangs |
workflow.incidents.created.total |
counter | tenant, error_type, bpmn_process_id | Incident rate |
workflow.incidents.resolved.total |
counter | tenant, error_type | Resolution rate |
workflow.incidents.active |
gauge | tenant | Incident backlog |
workflow.user_tasks.created.total |
counter | tenant, bpmn_process_id | Task creation |
workflow.user_tasks.completed.total |
counter | tenant, bpmn_process_id, assignee_pattern | Task completion |
workflow.user_tasks.duration |
histogram | tenant | How long tasks pending |
workflow.user_tasks.active |
gauge | tenant | Pending tasks |
workflow.timers.scheduled.total |
counter | tenant | Timer activity |
workflow.timers.fired.total |
counter | tenant | Timer fires |
workflow.messages.published.total |
counter | tenant, message_name | Message activity |
workflow.messages.correlated.total |
counter | tenant, message_name, outcome | Correlation success |
workflow.deployments.created.total |
counter | tenant | Deploy frequency |
Engine internal metrics¶
| Metric | Type | Labels | Purpose |
|---|---|---|---|
engine.commands.processed.total |
counter | intent, tenant | Engine throughput |
engine.commands.processing.duration |
histogram | intent | Engine latency |
engine.commands.queue.depth |
gauge | - | Queue saturation |
engine.commands.queue.depth.percent |
gauge | - | Saturation 0-100% |
engine.commands.rejected.total |
counter | reason | Backpressure |
engine.replay.position |
gauge | - | Catch-up tracking |
engine.cache.process_definitions.hit_rate |
gauge | - | Cache effectiveness |
engine.cache.size |
gauge | cache_type | Memory tracking |
System metrics (standard)¶
| Metric | Type | Source |
|---|---|---|
process.cpu.utilization |
gauge | Process |
process.memory.usage |
gauge | Process |
process.threads.count |
gauge | Process |
go.gc.duration (or JVM equiv) |
histogram | Runtime |
go.goroutines.count |
gauge | Runtime |
http.server.duration |
histogram | HTTP server |
http.server.requests.total |
counter | HTTP server |
db.connections.active |
gauge | DB pool |
db.connections.idle |
gauge | DB pool |
db.connections.waiting |
gauge | DB pool |
db.query.duration |
histogram | DB queries |
PostgreSQL metrics (via postgres_exporter)¶
| Metric | Purpose |
|---|---|
pg_stat_database_* |
DB-level stats |
pg_replication_lag_bytes |
Replication health |
pg_stat_statements_* |
Top slow queries |
pg_stat_user_tables_* |
Table activity |
pg_stat_bgwriter_* |
Background writer |
pg_locks_* |
Lock contention |
pg_settings_* |
Configuration |
Cardinality management¶
Labels combinatorial explosion = metric storage explosion. Rules:
High cardinality (AVOID en labels)¶
process_instance_key— millions of valuesjob_key— millionsuser_id— thousands+business_id— user-controlled
These belong in logs/traces, NOT metrics.
Acceptable cardinality (OK en labels)¶
tenant_id— typically < 10,000bpmn_process_id— typically < 100 per tenantjob_type— typically < 100 per tenanterror_type— < 20
Calculate before adding label:
Logs strategy¶
Structured logging mandatory¶
NO printf style. JSON always:
{
"timestamp": "2025-05-14T12:34:56.789Z",
"level": "INFO",
"logger": "engine.processor",
"trace_id": "abc123def456...",
"span_id": "78901234abcd...",
"tenant_id": "acme",
"process_instance_key": 12345,
"intent": "ELEMENT_COMPLETED",
"element_id": "task1",
"duration_ms": 23,
"message": "Element completed successfully"
}
Required fields:
- timestamp (ISO 8601 UTC)
- level (DEBUG, INFO, WARN, ERROR, FATAL)
- logger (component name)
- message (human-readable)
- trace_id (correlation con distributed trace)
Optional but recommended:
- span_id
- tenant_id
- Resource IDs (when applicable)
Log levels¶
| Level | Use for | Volume |
|---|---|---|
| DEBUG | Detailed flow, dev-only | Off in prod |
| INFO | Routine operations | Sampled in prod (1%) |
| WARN | Recoverable issues | All in prod |
| ERROR | Failures, incidents | All in prod |
| FATAL | About to die | All in prod |
Sampling strategy¶
For high-volume INFO logs, sample:
import random
def log_info_sampled(msg, sample_rate=0.01, **fields):
if random.random() < sample_rate or fields.get('important'):
logger.info(msg, **fields)
# For critical paths: always log
log_info_sampled("Process completed", sample_rate=1.0, important=True, ...)
# For routine: sample
log_info_sampled("Variable read", sample_rate=0.001, ...)
Always-log: ERROR, WARN, security events, audit-relevant.
Redaction¶
Per analysis/security-threat-model T4.3:
SENSITIVE_KEYS = re.compile(r'password|token|secret|api_?key|ssn|credit', re.I)
def redact(obj):
if isinstance(obj, dict):
return {k: '[REDACTED]' if SENSITIVE_KEYS.search(k) else redact(v)
for k, v in obj.items()}
if isinstance(obj, list):
return [redact(v) for v in obj]
return obj
Applied automatically before logging.
Log retention¶
| Log type | Hot retention | Cold retention |
|---|---|---|
| ERROR/FATAL | 30 days | 1 year |
| WARN | 14 days | 90 days |
| INFO sampled | 7 days | 30 days |
| DEBUG | Off in prod | - |
| Audit log | 30 days hot | 7 years (compliance) |
Storage backends: - Hot: Loki/Elasticsearch/Splunk - Cold: S3 with Athena queries
Distributed tracing¶
Span design¶
For each "operation" create a span:
with tracer.start_as_current_span("process_command") as span:
span.set_attribute("workflow.tenant.id", tenant_id)
span.set_attribute("workflow.command.intent", intent)
span.set_attribute("workflow.process.id", bpmn_process_id)
# Process...
span.set_attribute("workflow.command.success", True)
Span hierarchy¶
HTTP request (parent span)
├── auth.validate_token
├── api.handler.create_instance
│ ├── db.command_log.insert
│ └── notify.engine
│ └── engine.process_command (could be separate trace if async)
│ ├── db.read_state
│ ├── bpmn.parse (cached usually)
│ ├── bpmn.process
│ │ ├── behavior.variable_mapping
│ │ ├── behavior.job_creation
│ │ └── ...
│ └── db.write_state_atomic
Workers continue trace:
worker.activate_jobs
├── http.request to engine
└── http.response (with trace context propagated back)
worker.handler.send_email
├── (continuation of process_command trace via baggage)
├── http.request to email provider
└── worker.complete_job
└── http.request to engine
End-to-end visibility from user request → worker → completion.
Sampling¶
100% sampling = expensive. Strategies:
# Tail-based sampling: keep all if request was slow or errored
def should_sample(trace):
return (
trace.duration_ms > 1000 or # Slow
trace.has_error or # Failed
trace.tenant in PREMIUM_TENANTS or # VIP
random.random() < 0.01 # 1% baseline
)
Most APMs handle this server-side.
Trace context propagation¶
W3C Trace Context standard:
GET /v2/jobs/12345 HTTP/1.1
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: vendor1=...
OTel SDK handles automatically.
Span attributes conventions¶
Per OTel semantic conventions:
- HTTP:
http.method,http.url,http.status_code - DB:
db.system,db.statement,db.user - Workflow custom:
workflow.tenant.id,workflow.process.id, etc.
Document workflow.* conventions explicitly for consistency.
Alerting design¶
Golden signals (SRE book)¶
Always alert on these:
- Latency: TP99 > threshold for N min
- Traffic: TPS unusual (anomaly)
- Errors: Error rate > threshold
- Saturation: Resource utilization high
For MVP engine:
# Latency
- alert: HighRequestLatency
expr: histogram_quantile(0.99, http_server_duration_bucket) > 1.0
for: 5m
severity: warning
# Traffic anomaly
- alert: TrafficDropSudden
expr: (rate(http_server_requests_total[5m])) < (rate(http_server_requests_total[1h] offset 1h)) * 0.5
for: 10m
severity: warning
# Error rate
- alert: HighErrorRate
expr: rate(http_server_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
severity: critical
# Saturation
- alert: EngineQueueSaturating
expr: engine_commands_queue_depth_percent > 80
for: 2m
severity: warning
- alert: EngineQueueCritical
expr: engine_commands_queue_depth_percent > 95
for: 1m
severity: critical
Business KPI alerts¶
- alert: ProcessInstanceCreationDropped
expr: rate(workflow_process_instances_created_total[15m]) < 1
for: 30m
severity: warning
annotations:
summary: "Process creation rate unusual"
- alert: IncidentRateHigh
expr: rate(workflow_incidents_created_total[5m]) > 1
for: 5m
severity: warning
Failure-mode alerts¶
Per analysis/failure-mode-analysis, specific alerts:
# F1: Engine crash
- alert: EngineDown
expr: up{job="workflow-engine"} == 0
for: 1m
severity: critical
# F2: Processing stuck
- alert: EngineProcessingStuck
expr: rate(engine_commands_processed_total[5m]) == 0
for: 5m
severity: critical
# F5: DB primary failed
- alert: PostgresPrimaryDown
expr: pg_up{role="primary"} == 0
for: 30s
severity: critical
# F6: Replication lag
- alert: PostgresReplicationLag
expr: pg_replication_lag_bytes > 100e6
for: 5m
severity: warning
# F7: Disk full
- alert: DiskFullCritical
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
for: 5m
severity: critical
Alert routing¶
# Routes
routes:
- match: { severity: critical }
receiver: pagerduty
continue: true
- match: { severity: critical }
receiver: slack-critical
- match: { severity: warning }
receiver: slack-warnings
- match: { team: db }
receiver: dba-team
Alert fatigue prevention¶
- Aggregation: same alert from multiple instances → one notification
- Inhibition: critical suppresses warning
- Time-based silences: maintenance windows
- De-duplication: state-based, not edge-based
SLI/SLO definitions¶
SLI (Service Level Indicators)¶
What we measure:
| SLI | Definition |
|---|---|
| Availability | % of requests that succeeded |
| Latency | % of requests faster than 1s |
| Throughput | TPS sustained |
| Data integrity | % of replay determinism tests passed |
| Incident resolution | % of incidents resolved within 1 hour |
SLO (Service Level Objectives)¶
What we commit:
| Phase | SLO |
|---|---|
| Phase 0 | 99% availability monthly |
| Phase 0 | 95% requests < 1s |
| Phase 0 | 100% no data loss |
| Phase 1 | 99.9% availability |
| Phase 1 | 99% requests < 1s |
| Phase 2 | 99.95% availability |
| Phase 2 | 99.5% requests < 500ms |
Error budget¶
SLO 99% = 1% unavailability allowed
1% of 30 days = 7.2 hours/month error budget
If burning: slow down feature releases, focus on reliability
If under: invest in features (some budget OK to spend)
# Error budget remaining
1 - (
(1 - sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d])))
/ 0.99 # SLO target
)
Track in Grafana, alert when error budget burning fast.
SLO burn rate alerts¶
# Burning error budget fast
- alert: ErrorBudgetBurnFast
expr: |
(
slo_error_rate_5m > (14.4 * (1 - 0.99)) # 14.4x = burns 7d budget in 1h
)
for: 2m
severity: critical
- alert: ErrorBudgetBurnMedium
expr: |
(
slo_error_rate_1h > (6 * (1 - 0.99)) # 6x = burns 7d budget in 6h
)
for: 15m
severity: warning
Multi-window approach (Google SRE book).
Dashboards design¶
Dashboard 1: Platform Health (5-min refresh)¶
For SREs at-a-glance: - 4 SLI panels (top row, big numbers) - Engine TPS / Latency / Errors / Queue - DB connections / Replication lag - Resources (CPU, Memory, Disk)
Dashboard 2: Business KPIs¶
For ops/PM: - Process instances created per hour (by process) - Completion rate - Incident rate (by type) - User task creation/completion rates - Average process duration - Top 10 processes by volume
Dashboard 3: Per-tenant¶
For customer success / billing: - Per-tenant TPS - Per-tenant resource usage - Per-tenant incident rate - Quota usage
Dashboard 4: Worker Performance¶
For dev teams: - Jobs per worker type - Worker latency (p50, p99) - Failure rate by job type - Activation queue depth
Dashboard 5: Postgres¶
For DBA: - Connections active/idle - Slow queries (pg_stat_statements) - Cache hit ratio - Replication lag - WAL stats - Top tables by size/growth
Cost optimization¶
Sample logs aggressively¶
INFO logs at 100K/sec × 1KB = 100 MB/sec = 8.6 TB/day = $$$ at storage. Sample to 1% = 86 GB/day. Save ERROR/WARN always.
Don't ingest debug logs to APM¶
DEBUG locally only. ALERT only on ERRORs.
Trace sampling¶
100% only for slow/error requests. Otherwise 1% baseline.
Metrics: avoid high-cardinality labels¶
process_instance_key label → millions of series → metrics DB explodes.
Retention tiered¶
Hot (recent): high-priced fast storage Warm (1 week): cheaper Cold (30+ days): S3 Glacier
Estimated costs¶
For MVP at ~100 TPS:
| Component | Cost (monthly) |
|---|---|
| Metrics (Prometheus or APM): 100K series × $0.10 | $100 |
| Logs (1% sampled): 86 GB/day × $0.50 = $1300/month | $1300 |
| Traces (tail-sampled): 1M spans/day × $0.001 | $30 |
| Total APM tooling | ~$1500 |
At higher scale, costs scale roughly linear.
Runbook automation¶
Alerts link to runbooks:
Runbook example:
# High Request Latency
## Detection
TP99 > 1s for 5 min
## Investigation
1. Check Grafana "Platform Health" dashboard
2. Look at top slow queries: `mvp-cli postgres slow-queries`
3. Check engine queue depth
4. Check DB replication lag
## Common causes + fixes
- DB query degradation → check pg_stat_statements
- Engine queue saturating → check why
- Worker overload → check job activation latency
## Escalation
After 30 min unresolved: page on-call engineer
Links¶
- adrs/adr-011-opentelemetry-instrumentation — Foundation
- adrs/adr-010-hybrid-monitoring-apm-inspector — APM strategy
- analysis/failure-mode-analysis — Failure → alert mapping
- analysis/security-threat-model — Audit log requirements
- Google SRE book
- The Four Golden Signals
- USE method
- RED method