ADR-011: OpenTelemetry para instrumentación¶
- Status: Accepted
- Date: 2026-05-14
- Tags: observability, instrumentation, standards
Context and Problem Statement¶
El engine necesita emitir metrics, traces, y logs para observability. ¿Usamos SDKs específicos de cada APM vendor (Datadog SDK, New Relic agent, etc.) o un standard agnostic?
Decision Drivers¶
- Vendor lock-in es problemático para platform engines
- OpenTelemetry es CNCF standard adoption-wide
- Major APM vendors (Dynatrace, Datadog, NR, Grafana) all support OTLP ingestion
- Switching vendors should be env-var change, not code rewrite
Considered Options¶
- OpenTelemetry (OTLP) — standard CNCF
- Prometheus + native client libs — metrics-only standard
- Vendor-specific SDK (Datadog, NR)
- Custom protocol — propietario
Decision Outcome¶
Chosen option: OpenTelemetry porque: - Standard CNCF adopted industry-wide - Vendor-agnostic via OTLP export - Unified metrics + traces + logs (single SDK) - Active development y community - Future-proof: nuevos vendors siempre soportan OTLP
Positive Consequences¶
- Cero vendor lock-in
- Switch APM = env vars (no code changes)
- Unified instrumentation (metrics + traces + logs)
- Distributed tracing automático
- Standard semantic conventions
- Future vendors auto-supported
Negative Consequences¶
- OTel SDK adds binary dependency
- Learning curve para developers nuevos
- Auto-instrumentation aún en evolución para algunos languages
- Tags/labels convention requiere documentación
Implementación¶
# Setup
from opentelemetry import metrics, trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.trace import TracerProvider
# Métricas
meter = metrics.get_meter("workflow-engine")
process_instances_created = meter.create_counter(
"workflow.process_instances.created",
unit="1",
description="Total process instances created"
)
# Traces
tracer = trace.get_tracer("workflow-engine")
# En código del engine
with tracer.start_as_current_span("process_command") as span:
span.set_attribute("command.intent", intent)
span.set_attribute("tenant.id", tenant_id)
span.set_attribute("process.definition.id", bpmn_process_id)
result = await process_command(command)
process_instances_created.add(1, {
"tenant": tenant_id,
"process": bpmn_process_id
})
Configuration por env vars¶
Swap vendors trivialmente:
# Dynatrace
OTEL_EXPORTER_OTLP_ENDPOINT=https://yourtenant.live.dynatrace.com/api/v2/otlp/v1
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Api-Token <token>"
OTEL_SERVICE_NAME=workflow-engine
OTEL_RESOURCE_ATTRIBUTES=service.version=1.0.0,deployment.environment=production
# O Datadog
OTEL_EXPORTER_OTLP_ENDPOINT=http://datadog-agent:4317
# O Grafana Cloud
OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod.grafana.net/otlp
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <base64>"
Semantic conventions¶
Seguir OTel conventions cuando posible:
# HTTP semantics
span.set_attribute("http.method", "POST")
span.set_attribute("http.status_code", 200)
span.set_attribute("http.route", "/v2/process-instances")
# Database semantics
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.statement", sql_query)
# Custom workflow semantics
span.set_attribute("workflow.tenant.id", tenant_id)
span.set_attribute("workflow.process.id", bpmn_process_id)
span.set_attribute("workflow.process_instance.key", pid)
Links¶
- adrs/adr-010-hybrid-monitoring-apm-inspector — Usa esto como foundation
- OpenTelemetry
- OTLP specification
- Semantic conventions