Kubernetes Operator — diseño y CRDs¶
Operator k8s para deployment / lifecycle / scaling del workflow engine. CRDs declarativos (Cluster, Tenant, Process), reconciliation loops, integración con Patroni-postgres y workload autoscaling.
¿Operator vs Helm chart?¶
| Approach | Pros | Cons | Cuándo usar |
|---|---|---|---|
| Helm chart only | Simple, declarative templates | No reactive; no lifecycle management; manual upgrades | M1 |
| Helm + Operator | Best of both worlds; Helm para install, operator para day-2 | Más componentes para mantener | M2+ |
| Operator-only | Coherente; lifecycle integral | Steeper learning para users | M3+ |
Decisión: Helm chart en M1, operator en M2 para day-2 ops.
CRDs propuestos¶
WorkflowCluster — el cluster del engine¶
apiVersion: workflow.example.com/v1alpha1
kind: WorkflowCluster
metadata:
name: prod-east
namespace: workflow-system
spec:
version: "1.2.3"
replicas: 3
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 4
memory: 8Gi
postgres:
# Opción A: referencia a un cluster Patroni existente
externalRef:
host: postgres-primary.workflow.svc
port: 5432
databaseSecret: wf-db-credentials
# Opción B: que el operator cree el Patroni
# managed:
# replicas: 3
# storageClassName: fast-ssd
# storageSize: 100Gi
network:
ingress:
enabled: true
host: wf.example.com
tlsSecret: wf-tls
observability:
metrics:
enabled: true
serviceMonitor: true # crea ServiceMonitor para Prometheus
tracing:
enabled: true
otlpEndpoint: otel-collector.observability.svc:4317
authentication:
oidc:
issuerURL: https://auth.example.com
clientIDSecretRef: oidc-secret
status:
phase: Running
conditions:
- type: Available
status: "True"
lastTransitionTime: "2026-05-14T10:00:00Z"
readyReplicas: 3
postgresStatus: Healthy
version: "1.2.3"
WorkflowTenant — tenant configuration¶
apiVersion: workflow.example.com/v1alpha1
kind: WorkflowTenant
metadata:
name: acme
spec:
clusterRef:
name: prod-east
displayName: "Acme Corporation"
quotas:
maxActiveInstances: 10000
maxJobsPerMinute: 1000
maxStorageGB: 50
isolation:
rls: true # Postgres Row-Level Security
status:
phase: Active
currentInstances: 4231
currentStorageGB: 12.3
WorkflowProcess — process definition GitOps¶
apiVersion: workflow.example.com/v1alpha1
kind: WorkflowProcess
metadata:
name: order-flow
namespace: acme-tenant
spec:
clusterRef:
name: prod-east
tenantRef:
name: acme
source:
# Opción A: inline BPMN
bpmn: |
<?xml version="1.0" encoding="UTF-8"?>
<bpmn:definitions ...>
...
</bpmn:definitions>
# Opción B: ConfigMap reference
# configMapRef:
# name: order-flow-bpmn
# key: process.bpmn
# Opción C: Git source
# gitRef:
# url: https://github.com/acme/processes.git
# path: processes/order-flow.bpmn
# revision: main
versionTag: "v1.2.3"
validate: strict # falla si el BPMN no es válido
status:
phase: Deployed
processDefinitionKey: 2251799813685250
version: 3
deployedAt: "2026-05-14T10:30:00Z"
GitOps natural: editar el repo, ArgoCD/Flux applica, operator deploya al engine.
WorkflowWorker (opcional) — gestionar workers¶
apiVersion: workflow.example.com/v1alpha1
kind: WorkflowWorker
metadata:
name: payment-worker
spec:
clusterRef:
name: prod-east
image: acme/payment-worker:1.2.3
jobTypes: [charge-payment, refund-payment]
replicas: 3
autoscaling:
minReplicas: 2
maxReplicas: 20
targetJobLagThreshold: 100 # custom metric
resources:
requests: { cpu: 500m, memory: 256Mi }
limits: { cpu: 2, memory: 1Gi }
authentication:
serviceAccount: worker-sa
Operator crea Deployment + HPA con custom metric (job lag).
Reconciliation loops¶
WorkflowCluster controller¶
Watch: WorkflowCluster
Compare desired vs actual:
- replicas → adjust StatefulSet replica count
- version → rolling upgrade (canary 1 pod, validate, continue)
- resources → update pod spec, rollout
- postgres external → ensure secret exists + reachable
- ingress → create/update Ingress resource
Update status:
- phase, readyReplicas, conditions
// Pseudo-Go con controller-runtime
func (r *WorkflowClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var cluster workflowv1.WorkflowCluster
if err := r.Get(ctx, req.NamespacedName, &cluster); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// 1. Validate Postgres connectivity
if err := r.checkPostgresReachable(ctx, &cluster); err != nil {
return r.updateCondition(ctx, &cluster, "PostgresUnreachable", err)
}
// 2. Run migrations if version changed
if cluster.Status.Version != cluster.Spec.Version {
if err := r.runMigrations(ctx, &cluster); err != nil {
return ctrl.Result{RequeueAfter: 60 * time.Second}, err
}
}
// 3. Ensure StatefulSet
if err := r.ensureStatefulSet(ctx, &cluster); err != nil {
return ctrl.Result{}, err
}
// 4. Ensure Service / Ingress
if err := r.ensureNetwork(ctx, &cluster); err != nil {
return ctrl.Result{}, err
}
// 5. Ensure ServiceMonitor / OTel config
if err := r.ensureObservability(ctx, &cluster); err != nil {
return ctrl.Result{}, err
}
// 6. Update status
return r.updateStatus(ctx, &cluster)
}
WorkflowProcess controller¶
Watch: WorkflowProcess + ConfigMap dependencies
On change:
1. Render BPMN (inline / ConfigMap / Git fetch)
2. Validate BPMN against version-tagged schema
3. POST /api/v1/processes/deploy to engine
4. If success: update status with key, version
5. If failure: update status with error + retry
Maneja errores comunes: - BPMN inválido → status Failed con razón. - Engine unreachable → retry exponential. - Tenant quota excedida → status Blocked.
WorkflowWorker controller¶
Reconciliation similar a Deployment + HPA:
- Crea Deployment con job_types como env var
- Crea HPA con metric "job_lag" desde Prometheus
- Watch para metrics availability
Custom metric API server:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payment-worker
spec:
scaleTargetRef:
kind: Deployment
name: payment-worker
minReplicas: 2
maxReplicas: 20
metrics:
- type: External
external:
metric:
name: wf_job_lag_seconds
selector:
matchLabels:
job_type: charge-payment
target:
type: AverageValue
averageValue: "30"
Upgrade strategy¶
Engine upgrade¶
- Reconciler detecta
spec.versioncambió. - Verifica compatibilidad (no skip MAJOR versions).
- Aplica DB migrations (idempotent).
- Canary: 1 pod con nueva versión.
- Health check 60s; rollback si fail.
- Rolling update progresivo: 1 → 2 → 3.
- Update status
version.
func (r *WorkflowClusterReconciler) upgradeStrategy(c *WorkflowCluster) {
if isBreaking(c.Status.Version, c.Spec.Version) {
return errors.New("breaking version change requires manual approval")
}
// Canary
if !c.Status.CanaryHealthy {
return r.canaryUpgrade(c)
}
return r.rollingUpgrade(c)
}
Annotation override para skip canary:
Backup pre-upgrade (opcional)¶
Annotation que pre-trigger Postgres backup antes de upgrade:
Postgres integration (Patroni)¶
El operator NO reimplementa Patroni. Opciones:
- External: spec.externalRef apunta a Patroni existente.
- Managed: spec.managed delega a CloudNativePG / Crunchy operator (preferido).
El operator espera a que el Cluster esté Ready, lee la connection string del secret que CNPG crea, y la inyecta al engine deployment.
Observability del operator¶
# Métricas del operator
operator_reconciliations_total{kind, outcome}
operator_reconciliation_duration_seconds{kind}
operator_resource_status{kind, namespace, name, status}
# Eventos
kubectl get events -n workflow-system
Operator events típicos:
NORMAL WorkflowCluster prod-east Reconciling cluster, replicas 3 → 5
NORMAL WorkflowCluster prod-east StatefulSet updated successfully
WARNING WorkflowCluster prod-east PostgresUnreachable: connection refused
NORMAL WorkflowProcess order-flow Deployed v3 (key=2251...)
WARNING WorkflowProcess order-flow ValidationFailed: missing taskDefinition on Task_charge
RBAC¶
Cluster-scoped permissions:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: workflow-operator
rules:
- apiGroups: [workflow.example.com]
resources: [workflowclusters, workflowtenants, workflowprocesses, workflowworkers]
verbs: [get, list, watch, create, update, patch, delete]
- apiGroups: [workflow.example.com]
resources: [workflowclusters/status, ...]
verbs: [update, patch]
- apiGroups: [apps]
resources: [statefulsets, deployments]
verbs: [get, list, watch, create, update, patch, delete]
- apiGroups: [""]
resources: [services, configmaps, secrets, events]
verbs: [get, list, watch, create, update, patch, delete]
- apiGroups: [networking.k8s.io]
resources: [ingresses]
verbs: [get, list, watch, create, update, patch, delete]
- apiGroups: [autoscaling/v2]
resources: [horizontalpodautoscalers]
verbs: [get, list, watch, create, update, patch, delete]
GitOps integration¶
ArgoCD/Flux watch Git repo con CRDs:
repo/
├── clusters/
│ └── prod-east.yaml (WorkflowCluster)
├── tenants/
│ ├── acme.yaml (WorkflowTenant)
│ └── beta.yaml
└── processes/
├── order-flow.yaml (WorkflowProcess)
└── payment-flow.yaml
Pull request → review → merge → ArgoCD apply → operator reconcile → engine deploys.
Multi-cluster (M4)¶
Para multi-region:
apiVersion: workflow.example.com/v1alpha1
kind: WorkflowClusterMesh
metadata:
name: global
spec:
clusters:
- name: us-east-1
role: primary
clusterRef: { name: wf-east, namespace: workflow-system, context: east }
- name: us-west-2
role: replica
clusterRef: { name: wf-west, namespace: workflow-system, context: west }
failoverPolicy:
type: Manual # o Automatic con healthcheck
Operator se conecta a múltiples clusters (via kubeconfig contexts) y orquesta failover.
Costos y trade-offs¶
Pros: - Day-2 ops declarativo (upgrades, scaling, configuration). - GitOps natural. - Reuso de tooling k8s (kubectl, Lens, Headlamp). - Custom metrics → HPA = autoscaling worker fleet.
Cons: - Build & mantenimiento del operator (~3-6 dev-months para v1). - Más componentes que monitorear. - K8s-locked (no estrategia "bare-metal solo").
Implementación¶
Stack¶
- operator-sdk o kubebuilder (oficial K8s).
- controller-runtime (low-level).
- Go (consistencia con engine).
- Helm chart para instalar el operator.
Release artifacts¶
- Operator container:
ghcr.io/example/workflow-operator:vX.Y.Z - CRDs:
kubectl apply -f https://example.com/operator-crds.yaml - Helm:
helm install workflow oci://ghcr.io/example/charts/workflow-operator - ClusterServiceVersion para OperatorHub (opcional).
Comparativa con alternativas¶
| Operator | Funcionalidad | Notas |
|---|---|---|
| Camunda Helm chart oficial | Templates statefulsets, no day-2 ops | Imperativo, no GitOps friendly |
| Camunda Operator | Roadmap incomplete | Sólo dev preview |
| CloudNativePG | Postgres-focused, no engine | Lo usamos como dependencia |
| Strimzi (Kafka) | Operator reference excelente | Patrones a copiar |
Roadmap¶
- M1: Helm chart con templates manuales. No CRDs.
- M2: Operator v0 con WorkflowCluster CRD. Reconciliation básica.
- M3: WorkflowProcess CRD (GitOps), WorkflowWorker con HPA.
- M4: Multi-cluster mesh, automated failover.
Referencias¶
- adrs/adr-003-single-node-mvp-incremental-scaling — phase strategy
- adrs/adr-020-patroni-postgres-ha — Postgres HA
- analysis/schema-migration-strategy — DB migrations en upgrades
- Kubebuilder book
- Operator pattern (k8s docs)
- CloudNativePG — operator Postgres reference