Saltar a contenido

Kubernetes Operator — diseño y CRDs

Operator k8s para deployment / lifecycle / scaling del workflow engine. CRDs declarativos (Cluster, Tenant, Process), reconciliation loops, integración con Patroni-postgres y workload autoscaling.

¿Operator vs Helm chart?

Approach Pros Cons Cuándo usar
Helm chart only Simple, declarative templates No reactive; no lifecycle management; manual upgrades M1
Helm + Operator Best of both worlds; Helm para install, operator para day-2 Más componentes para mantener M2+
Operator-only Coherente; lifecycle integral Steeper learning para users M3+

Decisión: Helm chart en M1, operator en M2 para day-2 ops.

CRDs propuestos

WorkflowCluster — el cluster del engine

apiVersion: workflow.example.com/v1alpha1
kind: WorkflowCluster
metadata:
  name: prod-east
  namespace: workflow-system
spec:
  version: "1.2.3"
  replicas: 3
  resources:
    requests:
      cpu: 2
      memory: 4Gi
    limits:
      cpu: 4
      memory: 8Gi
  postgres:
    # Opción A: referencia a un cluster Patroni existente
    externalRef:
      host: postgres-primary.workflow.svc
      port: 5432
      databaseSecret: wf-db-credentials
    # Opción B: que el operator cree el Patroni
    # managed:
    #   replicas: 3
    #   storageClassName: fast-ssd
    #   storageSize: 100Gi
  network:
    ingress:
      enabled: true
      host: wf.example.com
      tlsSecret: wf-tls
  observability:
    metrics:
      enabled: true
      serviceMonitor: true  # crea ServiceMonitor para Prometheus
    tracing:
      enabled: true
      otlpEndpoint: otel-collector.observability.svc:4317
  authentication:
    oidc:
      issuerURL: https://auth.example.com
      clientIDSecretRef: oidc-secret
status:
  phase: Running
  conditions:
    - type: Available
      status: "True"
      lastTransitionTime: "2026-05-14T10:00:00Z"
  readyReplicas: 3
  postgresStatus: Healthy
  version: "1.2.3"

WorkflowTenant — tenant configuration

apiVersion: workflow.example.com/v1alpha1
kind: WorkflowTenant
metadata:
  name: acme
spec:
  clusterRef:
    name: prod-east
  displayName: "Acme Corporation"
  quotas:
    maxActiveInstances: 10000
    maxJobsPerMinute: 1000
    maxStorageGB: 50
  isolation:
    rls: true  # Postgres Row-Level Security
status:
  phase: Active
  currentInstances: 4231
  currentStorageGB: 12.3

WorkflowProcess — process definition GitOps

apiVersion: workflow.example.com/v1alpha1
kind: WorkflowProcess
metadata:
  name: order-flow
  namespace: acme-tenant
spec:
  clusterRef:
    name: prod-east
  tenantRef:
    name: acme
  source:
    # Opción A: inline BPMN
    bpmn: |
      <?xml version="1.0" encoding="UTF-8"?>
      <bpmn:definitions ...>
        ...
      </bpmn:definitions>
    # Opción B: ConfigMap reference
    # configMapRef:
    #   name: order-flow-bpmn
    #   key: process.bpmn
    # Opción C: Git source
    # gitRef:
    #   url: https://github.com/acme/processes.git
    #   path: processes/order-flow.bpmn
    #   revision: main
  versionTag: "v1.2.3"
  validate: strict  # falla si el BPMN no es válido
status:
  phase: Deployed
  processDefinitionKey: 2251799813685250
  version: 3
  deployedAt: "2026-05-14T10:30:00Z"

GitOps natural: editar el repo, ArgoCD/Flux applica, operator deploya al engine.

WorkflowWorker (opcional) — gestionar workers

apiVersion: workflow.example.com/v1alpha1
kind: WorkflowWorker
metadata:
  name: payment-worker
spec:
  clusterRef:
    name: prod-east
  image: acme/payment-worker:1.2.3
  jobTypes: [charge-payment, refund-payment]
  replicas: 3
  autoscaling:
    minReplicas: 2
    maxReplicas: 20
    targetJobLagThreshold: 100  # custom metric
  resources:
    requests: { cpu: 500m, memory: 256Mi }
    limits:   { cpu: 2, memory: 1Gi }
  authentication:
    serviceAccount: worker-sa

Operator crea Deployment + HPA con custom metric (job lag).

Reconciliation loops

WorkflowCluster controller

Watch: WorkflowCluster
Compare desired vs actual:
  - replicas → adjust StatefulSet replica count
  - version → rolling upgrade (canary 1 pod, validate, continue)
  - resources → update pod spec, rollout
  - postgres external → ensure secret exists + reachable
  - ingress → create/update Ingress resource
Update status:
  - phase, readyReplicas, conditions
// Pseudo-Go con controller-runtime
func (r *WorkflowClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var cluster workflowv1.WorkflowCluster
    if err := r.Get(ctx, req.NamespacedName, &cluster); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // 1. Validate Postgres connectivity
    if err := r.checkPostgresReachable(ctx, &cluster); err != nil {
        return r.updateCondition(ctx, &cluster, "PostgresUnreachable", err)
    }

    // 2. Run migrations if version changed
    if cluster.Status.Version != cluster.Spec.Version {
        if err := r.runMigrations(ctx, &cluster); err != nil {
            return ctrl.Result{RequeueAfter: 60 * time.Second}, err
        }
    }

    // 3. Ensure StatefulSet
    if err := r.ensureStatefulSet(ctx, &cluster); err != nil {
        return ctrl.Result{}, err
    }

    // 4. Ensure Service / Ingress
    if err := r.ensureNetwork(ctx, &cluster); err != nil {
        return ctrl.Result{}, err
    }

    // 5. Ensure ServiceMonitor / OTel config
    if err := r.ensureObservability(ctx, &cluster); err != nil {
        return ctrl.Result{}, err
    }

    // 6. Update status
    return r.updateStatus(ctx, &cluster)
}

WorkflowProcess controller

Watch: WorkflowProcess + ConfigMap dependencies
On change:
  1. Render BPMN (inline / ConfigMap / Git fetch)
  2. Validate BPMN against version-tagged schema
  3. POST /api/v1/processes/deploy to engine
  4. If success: update status with key, version
  5. If failure: update status with error + retry

Maneja errores comunes: - BPMN inválido → status Failed con razón. - Engine unreachable → retry exponential. - Tenant quota excedida → status Blocked.

WorkflowWorker controller

Reconciliation similar a Deployment + HPA:
  - Crea Deployment con job_types como env var
  - Crea HPA con metric "job_lag" desde Prometheus
  - Watch para metrics availability

Custom metric API server:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-worker
spec:
  scaleTargetRef:
    kind: Deployment
    name: payment-worker
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: External
      external:
        metric:
          name: wf_job_lag_seconds
          selector:
            matchLabels:
              job_type: charge-payment
        target:
          type: AverageValue
          averageValue: "30"

Upgrade strategy

Engine upgrade

  1. Reconciler detecta spec.version cambió.
  2. Verifica compatibilidad (no skip MAJOR versions).
  3. Aplica DB migrations (idempotent).
  4. Canary: 1 pod con nueva versión.
  5. Health check 60s; rollback si fail.
  6. Rolling update progresivo: 1 → 2 → 3.
  7. Update status version.
func (r *WorkflowClusterReconciler) upgradeStrategy(c *WorkflowCluster) {
    if isBreaking(c.Status.Version, c.Spec.Version) {
        return errors.New("breaking version change requires manual approval")
    }
    // Canary
    if !c.Status.CanaryHealthy {
        return r.canaryUpgrade(c)
    }
    return r.rollingUpgrade(c)
}

Annotation override para skip canary:

metadata:
  annotations:
    workflow.example.com/skip-canary: "true"  # emergency

Backup pre-upgrade (opcional)

Annotation que pre-trigger Postgres backup antes de upgrade:

metadata:
  annotations:
    workflow.example.com/backup-before-upgrade: "true"

Postgres integration (Patroni)

El operator NO reimplementa Patroni. Opciones:

  1. External: spec.externalRef apunta a Patroni existente.
  2. Managed: spec.managed delega a CloudNativePG / Crunchy operator (preferido).
spec:
  postgres:
    managedRef:
      apiGroup: postgresql.cnpg.io
      kind: Cluster
      name: wf-postgres

El operator espera a que el Cluster esté Ready, lee la connection string del secret que CNPG crea, y la inyecta al engine deployment.

Observability del operator

# Métricas del operator
operator_reconciliations_total{kind, outcome}
operator_reconciliation_duration_seconds{kind}
operator_resource_status{kind, namespace, name, status}

# Eventos
kubectl get events -n workflow-system

Operator events típicos:

NORMAL  WorkflowCluster  prod-east  Reconciling cluster, replicas 3 → 5
NORMAL  WorkflowCluster  prod-east  StatefulSet updated successfully
WARNING WorkflowCluster  prod-east  PostgresUnreachable: connection refused
NORMAL  WorkflowProcess  order-flow  Deployed v3 (key=2251...)
WARNING WorkflowProcess  order-flow  ValidationFailed: missing taskDefinition on Task_charge

RBAC

Cluster-scoped permissions:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: workflow-operator
rules:
  - apiGroups: [workflow.example.com]
    resources: [workflowclusters, workflowtenants, workflowprocesses, workflowworkers]
    verbs: [get, list, watch, create, update, patch, delete]
  - apiGroups: [workflow.example.com]
    resources: [workflowclusters/status, ...]
    verbs: [update, patch]
  - apiGroups: [apps]
    resources: [statefulsets, deployments]
    verbs: [get, list, watch, create, update, patch, delete]
  - apiGroups: [""]
    resources: [services, configmaps, secrets, events]
    verbs: [get, list, watch, create, update, patch, delete]
  - apiGroups: [networking.k8s.io]
    resources: [ingresses]
    verbs: [get, list, watch, create, update, patch, delete]
  - apiGroups: [autoscaling/v2]
    resources: [horizontalpodautoscalers]
    verbs: [get, list, watch, create, update, patch, delete]

GitOps integration

ArgoCD/Flux watch Git repo con CRDs:

repo/
├── clusters/
│   └── prod-east.yaml  (WorkflowCluster)
├── tenants/
│   ├── acme.yaml       (WorkflowTenant)
│   └── beta.yaml
└── processes/
    ├── order-flow.yaml (WorkflowProcess)
    └── payment-flow.yaml

Pull request → review → merge → ArgoCD apply → operator reconcile → engine deploys.

Multi-cluster (M4)

Para multi-region:

apiVersion: workflow.example.com/v1alpha1
kind: WorkflowClusterMesh
metadata:
  name: global
spec:
  clusters:
    - name: us-east-1
      role: primary
      clusterRef: { name: wf-east, namespace: workflow-system, context: east }
    - name: us-west-2
      role: replica
      clusterRef: { name: wf-west, namespace: workflow-system, context: west }
  failoverPolicy:
    type: Manual  # o Automatic con healthcheck

Operator se conecta a múltiples clusters (via kubeconfig contexts) y orquesta failover.

Costos y trade-offs

Pros: - Day-2 ops declarativo (upgrades, scaling, configuration). - GitOps natural. - Reuso de tooling k8s (kubectl, Lens, Headlamp). - Custom metrics → HPA = autoscaling worker fleet.

Cons: - Build & mantenimiento del operator (~3-6 dev-months para v1). - Más componentes que monitorear. - K8s-locked (no estrategia "bare-metal solo").

Implementación

Stack

  • operator-sdk o kubebuilder (oficial K8s).
  • controller-runtime (low-level).
  • Go (consistencia con engine).
  • Helm chart para instalar el operator.

Release artifacts

  • Operator container: ghcr.io/example/workflow-operator:vX.Y.Z
  • CRDs: kubectl apply -f https://example.com/operator-crds.yaml
  • Helm: helm install workflow oci://ghcr.io/example/charts/workflow-operator
  • ClusterServiceVersion para OperatorHub (opcional).

Comparativa con alternativas

Operator Funcionalidad Notas
Camunda Helm chart oficial Templates statefulsets, no day-2 ops Imperativo, no GitOps friendly
Camunda Operator Roadmap incomplete Sólo dev preview
CloudNativePG Postgres-focused, no engine Lo usamos como dependencia
Strimzi (Kafka) Operator reference excelente Patrones a copiar

Roadmap

  • M1: Helm chart con templates manuales. No CRDs.
  • M2: Operator v0 con WorkflowCluster CRD. Reconciliation básica.
  • M3: WorkflowProcess CRD (GitOps), WorkflowWorker con HPA.
  • M4: Multi-cluster mesh, automated failover.

Referencias