Saltar a contenido

Security Threat Model

Análisis de threats STRIDE-based para el MVP. Identifica 28 threats específicos del workflow engine, sus mitigations, y residual risks. Areas: auth/authz, multi-tenancy isolation, data at rest/transit, worker security, BPMN parsing attacks, supply chain. Defense in depth: TLS everywhere + RLS + audit logging + secret management. Genera ADR-023 (TLS-only), ADR-024 (RLS defense in depth), ADR-025 (audit logging mandatory).

Por qué este análisis

Workflow engines almacenan business-critical state: orders, approvals, financial transactions, customer data. Una breach es catastrofic: - Data exfiltration (customer data) - Tampering (alterar approvals) - DoS (sistema down = negocio down) - Privilege escalation (operator → admin → root)

Camunda no publica threat model formal pero implementa muchos mitigations. Este análisis es para el MVP.

STRIDE framework

STRIDE = Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege.

Per-component analysis:

Component Spoofing Tampering Repudiation Info Disc DoS EoP
REST API
Engine - -
PostgreSQL
Workers
Modeler files - - -
Webapps UI

Threats identificados

S - Spoofing (impersonation)

T1.1: Worker impersonation via stolen API key

Threat: Attacker obtiene API key (env var leak, repo commit, log scrape) → ejecuta jobs como worker legítimo.

Mitigations: - API keys hashed in DB (bcrypt/argon2), NUNCA store plaintext - API keys con expiration (default 90 days) - API key rotation tooling (mvp-cli api-keys rotate) - Per-key audit log (which keys used, when, what jobs) - Restrict API key scopes (read-only vs read-write) - Network ACL si possible (VPN, source IP restriction)

Residual risk: Medium. Insider con DB access ve hashes (no plaintext). Active rotation reduces window.

T1.2: User impersonation via stolen JWT

Threat: JWT con long expiry stolen → attacker actúa como user.

Mitigations: - JWT expiry corta (15 min recomendado, refresh token con longer expiry) - Token revocation list (Redis or DB) - IP/User-Agent binding (detect anomalies) - Logout invalidates server-side - 2FA via IdP (Auth0/Okta provee)

Residual risk: Low si JWT expiry corta + revocation list activo.

T1.3: Webhook spoofing

Threat: Attacker envía fake webhook a POST /v2/messages/{name}/correlate para inject events maliciosos.

Mitigations: - API authentication required (no public webhooks) - HMAC signing si webhook from external system: validate signature - Correlation key validation (no arbitrary values) - Rate limit per source

Residual risk: Medium si user expose webhooks publicly without HMAC.

T - Tampering (modification)

T2.1: BPMN model tampering during deployment

Threat: Attacker uploads malicious BPMN que ejecuta arbitrary code or exfiltrates data via expressions.

Mitigations: - BPMN XML strict validation (XSD + semantic) - Expression engine sandbox (CEL es non-Turing-complete, no I/O) - NO custom function definitions - Expression evaluation timeout (1 second) - Permission check: only admin role can deploy - Audit log of all deployments - Diff review process (deploy → review → activate)

Residual risk: Low si CEL strict + admin-only deploy.

T2.2: Database tampering via SQL injection

Threat: SQL injection en API endpoints → arbitrary state mutations.

Mitigations: - NEVER string concat SQL — always parameterized queries - Use ORM o query builder - Input validation strict - Postgres role permissions limited (engine NO superuser) - Audit log de DDL operations - SAST tooling (CodeQL, Semgrep) en CI

Residual risk: Very low si discipline en código.

T2.3: Variable tampering

Threat: User con write access altera variables of running instance to bypass logic.

Mitigations: - Permission check UPDATE_VARIABLES strict - Audit log de variable updates (who, when, before/after) - Process modification permission separate - Immutable variables (BPMN attribute) — engine rejects updates - Cryptographic signatures opcional para sensitive variables

Residual risk: Medium. Variable mutability is by design para BPMN. Audit + permission limita.

T2.4: In-flight data tampering (MITM)

Threat: Man-in-the-middle altera requests/responses entre cliente y engine.

Mitigations: - TLS 1.3 mandatory (no plaintext HTTP) → ADR-023 - Certificate pinning en SDKs opcional - mTLS para worker-engine si possible - HSTS headers

Residual risk: Very low con TLS 1.3 properly configured.

R - Repudiation (denial of actions)

T3.1: User claims "I didn't approve this loan"

Threat: User completes user task → later denies it.

Mitigations: - Comprehensive audit log (mandatory, immutable) - User ID (from JWT) - Timestamp - Action (approve, reject) - IP address, User-Agent - Variables values - JWT claims snapshot - Audit log signed (HMAC con server secret) - Append-only audit table (no UPDATEs allowed) - Retention per compliance (7 years SOX)

→ ADR-025: audit logging mandatory

Residual risk: Low si audit log comprehensive.

T3.2: Worker claims "I didn't complete this job"

Threat: Worker procesa malicious action → niega haber sido él.

Mitigations: - Per-job audit: which API key completed - Worker identification en completion (hostname, pid) - Source IP de la request - Activation_count tracking (catch stale completions)

Residual risk: Low.

I - Information Disclosure

T4.1: Cross-tenant data leak via API

Threat: User of tenant A queries tenant B data via SQL injection o bug.

Mitigations: - Tenant_id en TODA query (enforced via code review + tests) - Postgres Row-Level Security as defense in depth → ADR-024

ALTER TABLE process_instances ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON process_instances
  FOR ALL USING (tenant_id = current_setting('app.current_tenant'));
- Engine sets app.current_tenant per connection - Tests específicos: try to query other tenant → expect rejection - Tenant_id NEVER user-controllable (always from JWT claims)

Residual risk: Very low con RLS + code discipline.

T4.2: Secrets in BPMN XML or variables

Threat: User puts API keys, passwords in process variables → exposed via Operate/Tasklist.

Mitigations: - Document strongly: variables son TODOS visible - Sensitive field detection (regex for token, password, apiKey) - Optional: variable encryption-at-rest with per-tenant key - BPMN reference to secret store, NOT inline:

<zeebe:header key="apiKey" value="{{secret:slack-token}}" />
- Resolved by worker, never stored in engine

Residual risk: Medium si users no follow guidelines.

T4.3: Sensitive data in logs

Threat: Engine logs request bodies con sensitive data.

Mitigations: - Structured logging (JSON), NEVER log full request body - Log only IDs y intents, not variable values - Mask sensitive fields automaticamente:

SENSITIVE_PATTERNS = [r'password', r'token', r'apiKey', r'ssn', r'creditCard']
def sanitize(obj):
    # Replace matching keys with [REDACTED]
- Audit log es separate, encrypted at rest

Residual risk: Low.

T4.4: Database backup exfiltration

Threat: Backup files stolen → contains ALL business data.

Mitigations: - Backups encrypted at rest (pgBackRest --cipher-type) - S3 server-side encryption (SSE) - Backup access ACL strict (separate IAM role) - Backup keys rotated regularly - Database-level encryption opcional (Postgres + TDE via Cipher)

Residual risk: Low si encryption properly configured.

T4.5: Memory dump exfiltration

Threat: Process memory contains decrypted data, variables.

Mitigations: - Disable core dumps en production (ulimit -c 0) - Memory protection (ASLR, DEP) - Clear sensitive data from memory after use:

# Use bytearray, zero after use
password_bytes = bytearray(password.encode())
# ... use ...
for i in range(len(password_bytes)):
    password_bytes[i] = 0
- Run in confined environment (container, namespaces)

Residual risk: Medium. Defense-in-depth, not preventable absolutely.

D - Denial of Service

T5.1: Resource exhaustion via huge BPMN

Threat: Upload 10MB BPMN with 100K elements → memory explosion.

Mitigations: - Max BPMN size limit (1 MB default) - Max elements per process (1000 default) - Max nesting depth (10 levels) - Parse timeout (5 seconds) - Reject before storing

Residual risk: Low.

T5.2: Resource exhaustion via huge variables

Threat: Process variables > 1GB → OOM.

Mitigations: - Per-variable max size (100 KB default — ver Intuit benchmark) - Per-instance max total variables (10 MB default) - Strict enforcement at API level - Reject command, return 413 Payload Too Large

Residual risk: Low.

T5.3: Process instance flood

Threat: Attacker creates 1M process instances per second.

Mitigations: - Rate limit per-tenant (ver concepts/backpressure-rest-strategy) - Quota per tenant (max active instances) - Cost-based throttling (premium gets more) - Auto-suspend tenant on suspicious activity

Residual risk: Low.

T5.4: Infinite loop in BPMN

Threat: Process modeled with infinite cycle (deliberate or bug).

Mitigations: - Detection: per-instance event count > threshold (10K events) → incident - Engine timeout for sub-process completion - Max recursion depth para call activities - Operator can cancel instance

Residual risk: Low.

T5.5: Job starvation

Threat: One process type dominates jobs queue → others starve.

Mitigations: - Fair scheduling per concepts/job-queue-fairness - Per-tenant queue partitioning - Worker concurrency per job type configurable - Priority levels respected

Residual risk: Low con fair queue.

E - Elevation of Privilege

T6.1: Operator escalates to admin

Threat: Operator finds bug que permite admin actions.

Mitigations: - Permission check EN CADA endpoint (no implicit assumptions) - Tests específicos por permission level - RBAC strict (3 roles, ver ADR-013) - Audit log de admin actions - Separation of duties (admin can't be both creator and approver of own actions)

Residual risk: Low si tests comprehensive.

T6.2: SQL injection → DB privilege escalation

Threat: Via SQL injection, attacker runs GRANT or accesses superuser.

Mitigations: - Engine DB user NO superuser - Engine DB user limited permissions: - INSERT, UPDATE, DELETE en engine tables - NO CREATE, DROP, ALTER - NO GRANT - Migrations run separately con different credentials - DB user can't access pg_authid (passwords)

Residual risk: Very low.

T6.3: Container escape via worker

Threat: Worker exploits container vulnerability → host access.

Mitigations: - Workers run in restricted containers (read-only filesystem, no NET_RAW) - Seccomp/AppArmor profiles - Runtime security tools (Falco) - Image scanning (Trivy, Snyk) - Pin base images by SHA - Update regularly

Residual risk: Medium (container security is hard).

T6.4: Supply chain attack

Threat: Malicious dependency (npm/PyPI) → backdoor.

Mitigations: - Lock files committed (package-lock.json, poetry.lock) - Dependency scanning (Snyk, Dependabot, Renovate) - Limit transitive dependencies - Internal package mirror si possible - SBOM generation - Sign releases (Sigstore)

Residual risk: Medium (ongoing concern, hard to eliminate fully).

Multi-tenancy security deep-dive

Crítico porque shared engine instance serves múltiples tenants.

Defense layers

flowchart TD
    L1[Layer 1: API authentication tenant from JWT]
    L2[Layer 2: Authorization role check per endpoint]
    L3[Layer 3: Request handlers tenant_id from auth, never from body]
    L4[Layer 4: SQL queries tenant_id en WHERE clause]
    L5[Layer 5: Postgres RLS defense in depth]
    L6[Layer 6: Audit logging anomaly detection]
    L1 --> L2 --> L3 --> L4 --> L5 --> L6

Common bugs to avoid

# BAD: tenant from request body
async def get_instance(request):
    body = await request.json()
    tenant_id = body['tenant_id']  # ← attacker controls!
    return await db.fetch_one("SELECT * FROM ... WHERE tenant_id = $1", tenant_id)

# GOOD: tenant from JWT (server-controlled)
async def get_instance(request):
    tenant_id = request.user.tenant_id  # from JWT claim
    return await db.fetch_one("SELECT * FROM ... WHERE tenant_id = $1", tenant_id)
# BAD: forget tenant_id in WHERE
async def cancel_instance(pi_key):
    await db.execute("UPDATE process_instances SET state = 'CANCELED' WHERE process_instance_key = $1", pi_key)

# GOOD: always include tenant_id
async def cancel_instance(pi_key, tenant_id):
    result = await db.execute("""
        UPDATE process_instances 
        SET state = 'CANCELED' 
        WHERE process_instance_key = $1 AND tenant_id = $2
    """, pi_key, tenant_id)
    if result.rowcount == 0:
        return 404  # or 403

Testing tenant isolation

async def test_tenant_isolation():
    """Tenant A cannot access tenant B's data."""
    # Create instances for both tenants
    instance_a = await create_instance(tenant='a', ...)
    instance_b = await create_instance(tenant='b', ...)

    # Try to query B's instance as user A
    user_a_token = create_jwt(tenant='a', role='admin')
    response = await client.get(
        f"/v2/process-instances/{instance_b.key}",
        headers={"Authorization": f"Bearer {user_a_token}"}
    )

    assert response.status_code == 404  # not 403 (avoid info leak)

Test suite específico para cross-tenant violations.

Authentication flow security

sequenceDiagram
    participant C as Cliente
    participant IdP as IdP (Auth0/Okta)
    participant E as Engine

    C->>IdP: Auth request (TLS 1.3, State CSRF, PKCE)
    IdP-->>C: Authorization code
    C->>IdP: Exchange code + PKCE verifier (short-lived, one-time)
    IdP-->>C: access_token (JWT, 15min) + refresh_token
    C->>E: API request with access_token
    Note over E: Validate JWT signature, expiry, audience,<br/>issuer, revocation list
    E-->>C: Process if valid

Encryption at rest

Postgres database

Options ordered by strength:

  1. TDE (Transparent Data Encryption): encrypted at filesystem level
  2. pgcrypto extension: encrypt specific columns
  3. Application-level encryption: encrypt sensitive fields before INSERT

Recomendación MVP: filesystem encryption (LUKS, cloud-native) + pgcrypto for ultra-sensitive (PII).

Backups

pgbackrest --cipher-type=aes-256-cbc \
           --cipher-pass=$CIPHER_KEY \
           backup

Keys management: - Vault (HashiCorp) - AWS KMS / GCP KMS / Azure Key Vault - Sealed Secrets (Kubernetes)

NEVER en env vars o config files.

Audit logging (ADR-025 candidate)

CREATE TABLE audit_log (
    audit_id BIGSERIAL PRIMARY KEY,
    timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    tenant_id TEXT NOT NULL,
    user_id TEXT,                    -- from JWT sub
    api_key_id UUID,                 -- if API key auth
    source_ip INET NOT NULL,
    user_agent TEXT,
    action TEXT NOT NULL,            -- e.g., 'process_instance.cancel'
    resource_type TEXT NOT NULL,
    resource_id TEXT,
    details JSONB NOT NULL,          -- before/after, request params
    success BOOLEAN NOT NULL,
    error_message TEXT,
    integrity_hmac TEXT NOT NULL     -- HMAC-SHA256 of row content
);

-- Append-only enforcement
REVOKE UPDATE, DELETE ON audit_log FROM ALL;

-- Index for queries
CREATE INDEX idx_audit_tenant_time ON audit_log(tenant_id, timestamp DESC);
CREATE INDEX idx_audit_user ON audit_log(user_id, timestamp DESC);

Loggable actions (mandatory): - Authentication (success/failure) - All POST/PATCH/DELETE on resources - Deployment changes - User/role changes - Permission changes - Configuration changes

NOT logged: read operations (would be massive, not security-relevant in most cases). Exception: read of sensitive resources (incident details, audit log itself).

Common attack scenarios

Scenario 1: Compromised worker API key

  1. Detection: anomalous usage pattern (different IP, different time)
  2. Response: revoke key, rotate
  3. Forensics: audit log shows all actions with that key
  4. Mitigation: rotate other keys if shared compromise suspected

Scenario 2: SQL injection found in code review

  1. Fix: parameterize query
  2. Audit log review: any successful exploitation?
  3. Patch + deploy emergency
  4. Add CodeQL/Semgrep rule to detect pattern

Scenario 3: Tenant data leak suspected

  1. Audit log: which user accessed what when
  2. Check RLS policies still active
  3. Review code for missing tenant_id checks
  4. Notify affected tenant per compliance (GDPR 72 hours)

Scenario 4: DDoS attempt

  1. Auto-scaling triggered (Phase 2+)
  2. Rate limiter blocks suspicious sources
  3. Cloudflare/WAF en front
  4. Monitor: which tenants affected, which IPs

Compliance considerations

Depending on target market:

Compliance Requirements covered Gaps
SOC2 Audit log, access control, encryption Penetration testing needed
GDPR Right to deletion, data minimization Privacy impact assessment
HIPAA Encryption, audit, access BAAs needed con providers
SOX Audit log retention, segregation of duties Specific to public companies
PCI-DSS If processing cards (rare for workflow) Don't store card data

MVP Phase 1: target SOC2-readiness as baseline.

Threats explicitly out of scope (Phase 1)

Defer to Phase 2+:

  • DDoS protection at scale (use cloud WAF: Cloudflare, AWS Shield)
  • Penetration testing (annual, external firm)
  • Bug bounty program
  • Custom security audit of dependencies
  • HSM integration for key management
  • FedRAMP/IL5 compliance

Resulting ADRs

This analysis genera 3 ADRs nuevos:

  • ADR-023: TLS 1.3 mandatory en producción (mitigation T2.4)
  • ADR-024: Postgres RLS as defense in depth (mitigation T4.1)
  • ADR-025: Audit logging mandatory para security operations (mitigation T3.1, T3.2)