Saltar a contenido

ADR-011: OpenTelemetry para instrumentación

  • Status: Accepted
  • Date: 2026-05-14
  • Tags: observability, instrumentation, standards

Context and Problem Statement

El engine necesita emitir metrics, traces, y logs para observability. ¿Usamos SDKs específicos de cada APM vendor (Datadog SDK, New Relic agent, etc.) o un standard agnostic?

Decision Drivers

  • Vendor lock-in es problemático para platform engines
  • OpenTelemetry es CNCF standard adoption-wide
  • Major APM vendors (Dynatrace, Datadog, NR, Grafana) all support OTLP ingestion
  • Switching vendors should be env-var change, not code rewrite

Considered Options

  1. OpenTelemetry (OTLP) — standard CNCF
  2. Prometheus + native client libs — metrics-only standard
  3. Vendor-specific SDK (Datadog, NR)
  4. Custom protocol — propietario

Decision Outcome

Chosen option: OpenTelemetry porque: - Standard CNCF adopted industry-wide - Vendor-agnostic via OTLP export - Unified metrics + traces + logs (single SDK) - Active development y community - Future-proof: nuevos vendors siempre soportan OTLP

Positive Consequences

  • Cero vendor lock-in
  • Switch APM = env vars (no code changes)
  • Unified instrumentation (metrics + traces + logs)
  • Distributed tracing automático
  • Standard semantic conventions
  • Future vendors auto-supported

Negative Consequences

  • OTel SDK adds binary dependency
  • Learning curve para developers nuevos
  • Auto-instrumentation aún en evolución para algunos languages
  • Tags/labels convention requiere documentación

Implementación

# Setup
from opentelemetry import metrics, trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.trace import TracerProvider

# Métricas
meter = metrics.get_meter("workflow-engine")
process_instances_created = meter.create_counter(
    "workflow.process_instances.created",
    unit="1",
    description="Total process instances created"
)

# Traces
tracer = trace.get_tracer("workflow-engine")

# En código del engine
with tracer.start_as_current_span("process_command") as span:
    span.set_attribute("command.intent", intent)
    span.set_attribute("tenant.id", tenant_id)
    span.set_attribute("process.definition.id", bpmn_process_id)

    result = await process_command(command)

    process_instances_created.add(1, {
        "tenant": tenant_id,
        "process": bpmn_process_id
    })

Configuration por env vars

Swap vendors trivialmente:

# Dynatrace
OTEL_EXPORTER_OTLP_ENDPOINT=https://yourtenant.live.dynatrace.com/api/v2/otlp/v1
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Api-Token <token>"
OTEL_SERVICE_NAME=workflow-engine
OTEL_RESOURCE_ATTRIBUTES=service.version=1.0.0,deployment.environment=production

# O Datadog
OTEL_EXPORTER_OTLP_ENDPOINT=http://datadog-agent:4317

# O Grafana Cloud
OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod.grafana.net/otlp
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <base64>"

Semantic conventions

Seguir OTel conventions cuando posible:

# HTTP semantics
span.set_attribute("http.method", "POST")
span.set_attribute("http.status_code", 200)
span.set_attribute("http.route", "/v2/process-instances")

# Database semantics
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.statement", sql_query)

# Custom workflow semantics
span.set_attribute("workflow.tenant.id", tenant_id)
span.set_attribute("workflow.process.id", bpmn_process_id)
span.set_attribute("workflow.process_instance.key", pid)