Security Threat Model

Análisis de threats STRIDE-based para el MVP. Identifica 28 threats específicos del workflow engine, sus mitigations, y residual risks. Areas: auth/authz, multi-tenancy isolation, data at rest/transit, worker security, BPMN parsing attacks, supply chain. Defense in depth: TLS everywhere + RLS + audit logging + secret management. Genera ADR-023 (TLS-only), ADR-024 (RLS defense in depth), ADR-025 (audit logging mandatory).

Por qué este análisis¶

Workflow engines almacenan business-critical state: orders, approvals, financial transactions, customer data. Una breach es catastrofic: - Data exfiltration (customer data) - Tampering (alterar approvals) - DoS (sistema down = negocio down) - Privilege escalation (operator → admin → root)

Camunda no publica threat model formal pero implementa muchos mitigations. Este análisis es para el MVP.

STRIDE framework¶

STRIDE = Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege.

Per-component analysis:

Component	Spoofing	Tampering	Repudiation	Info Disc	DoS	EoP
REST API	✓	✓	✓	✓	✓	✓
Engine	-	✓	✓	✓	✓	-
PostgreSQL	✓	✓	✓	✓	✓	✓
Workers	✓	✓	✓	✓	✓	✓
Modeler files	-	✓	-	✓	-	✓
Webapps UI	✓	✓	✓	✓	✓	✓

Threats identificados¶

S - Spoofing (impersonation)¶

T1.1: Worker impersonation via stolen API key¶

Threat: Attacker obtiene API key (env var leak, repo commit, log scrape) → ejecuta jobs como worker legítimo.

Mitigations: - API keys hashed in DB (bcrypt/argon2), NUNCA store plaintext - API keys con expiration (default 90 days) - API key rotation tooling (mvp-cli api-keys rotate) - Per-key audit log (which keys used, when, what jobs) - Restrict API key scopes (read-only vs read-write) - Network ACL si possible (VPN, source IP restriction)

Residual risk: Medium. Insider con DB access ve hashes (no plaintext). Active rotation reduces window.

T1.2: User impersonation via stolen JWT¶

Threat: JWT con long expiry stolen → attacker actúa como user.

Mitigations: - JWT expiry corta (15 min recomendado, refresh token con longer expiry) - Token revocation list (Redis or DB) - IP/User-Agent binding (detect anomalies) - Logout invalidates server-side - 2FA via IdP (Auth0/Okta provee)

Residual risk: Low si JWT expiry corta + revocation list activo.

T1.3: Webhook spoofing¶

Threat: Attacker envía fake webhook a POST /v2/messages/{name}/correlate para inject events maliciosos.

Mitigations: - API authentication required (no public webhooks) - HMAC signing si webhook from external system: validate signature - Correlation key validation (no arbitrary values) - Rate limit per source

Residual risk: Medium si user expose webhooks publicly without HMAC.

T - Tampering (modification)¶

T2.1: BPMN model tampering during deployment¶

Threat: Attacker uploads malicious BPMN que ejecuta arbitrary code or exfiltrates data via expressions.

Mitigations: - BPMN XML strict validation (XSD + semantic) - Expression engine sandbox (CEL es non-Turing-complete, no I/O) - NO custom function definitions - Expression evaluation timeout (1 second) - Permission check: only admin role can deploy - Audit log of all deployments - Diff review process (deploy → review → activate)

Residual risk: Low si CEL strict + admin-only deploy.

T2.2: Database tampering via SQL injection¶

Threat: SQL injection en API endpoints → arbitrary state mutations.

Mitigations: - NEVER string concat SQL — always parameterized queries - Use ORM o query builder - Input validation strict - Postgres role permissions limited (engine NO superuser) - Audit log de DDL operations - SAST tooling (CodeQL, Semgrep) en CI

Residual risk: Very low si discipline en código.

T2.3: Variable tampering¶

Threat: User con write access altera variables of running instance to bypass logic.

Mitigations: - Permission check UPDATE_VARIABLES strict - Audit log de variable updates (who, when, before/after) - Process modification permission separate - Immutable variables (BPMN attribute) — engine rejects updates - Cryptographic signatures opcional para sensitive variables

Residual risk: Medium. Variable mutability is by design para BPMN. Audit + permission limita.

T2.4: In-flight data tampering (MITM)¶

Threat: Man-in-the-middle altera requests/responses entre cliente y engine.

Mitigations: - TLS 1.3 mandatory (no plaintext HTTP) → ADR-023 - Certificate pinning en SDKs opcional - mTLS para worker-engine si possible - HSTS headers

Residual risk: Very low con TLS 1.3 properly configured.

R - Repudiation (denial of actions)¶

T3.1: User claims "I didn't approve this loan"¶

Threat: User completes user task → later denies it.

Mitigations: - Comprehensive audit log (mandatory, immutable) - User ID (from JWT) - Timestamp - Action (approve, reject) - IP address, User-Agent - Variables values - JWT claims snapshot - Audit log signed (HMAC con server secret) - Append-only audit table (no UPDATEs allowed) - Retention per compliance (7 years SOX)

→ ADR-025: audit logging mandatory

Residual risk: Low si audit log comprehensive.

T3.2: Worker claims "I didn't complete this job"¶

Threat: Worker procesa malicious action → niega haber sido él.

Mitigations: - Per-job audit: which API key completed - Worker identification en completion (hostname, pid) - Source IP de la request - Activation_count tracking (catch stale completions)

Residual risk: Low.

I - Information Disclosure¶

T4.1: Cross-tenant data leak via API¶

Threat: User of tenant A queries tenant B data via SQL injection o bug.

Mitigations: - Tenant_id en TODA query (enforced via code review + tests) - Postgres Row-Level Security as defense in depth → ADR-024

ALTER TABLE process_instances ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON process_instances
  FOR ALL USING (tenant_id = current_setting('app.current_tenant'));

- Engine sets app.current_tenant per connection - Tests específicos: try to query other tenant → expect rejection - Tenant_id NEVER user-controllable (always from JWT claims)

Residual risk: Very low con RLS + code discipline.

T4.2: Secrets in BPMN XML or variables¶

Threat: User puts API keys, passwords in process variables → exposed via Operate/Tasklist.

Mitigations: - Document strongly: variables son TODOS visible - Sensitive field detection (regex for token, password, apiKey) - Optional: variable encryption-at-rest with per-tenant key - BPMN reference to secret store, NOT inline:

<zeebe:header key="apiKey" value="{{secret:slack-token}}" />

- Resolved by worker, never stored in engine

Residual risk: Medium si users no follow guidelines.

T4.3: Sensitive data in logs¶

Threat: Engine logs request bodies con sensitive data.

Mitigations: - Structured logging (JSON), NEVER log full request body - Log only IDs y intents, not variable values - Mask sensitive fields automaticamente:

SENSITIVE_PATTERNS = [r'password', r'token', r'apiKey', r'ssn', r'creditCard']
def sanitize(obj):
    # Replace matching keys with [REDACTED]

- Audit log es separate, encrypted at rest

Residual risk: Low.

T4.4: Database backup exfiltration¶

Threat: Backup files stolen → contains ALL business data.

Mitigations: - Backups encrypted at rest (pgBackRest --cipher-type) - S3 server-side encryption (SSE) - Backup access ACL strict (separate IAM role) - Backup keys rotated regularly - Database-level encryption opcional (Postgres + TDE via Cipher)

Residual risk: Low si encryption properly configured.

T4.5: Memory dump exfiltration¶

Threat: Process memory contains decrypted data, variables.

Mitigations: - Disable core dumps en production (ulimit -c 0) - Memory protection (ASLR, DEP) - Clear sensitive data from memory after use:

# Use bytearray, zero after use
password_bytes = bytearray(password.encode())
# ... use ...
for i in range(len(password_bytes)):
    password_bytes[i] = 0

- Run in confined environment (container, namespaces)

Residual risk: Medium. Defense-in-depth, not preventable absolutely.

D - Denial of Service¶

T5.1: Resource exhaustion via huge BPMN¶

Threat: Upload 10MB BPMN with 100K elements → memory explosion.

Mitigations: - Max BPMN size limit (1 MB default) - Max elements per process (1000 default) - Max nesting depth (10 levels) - Parse timeout (5 seconds) - Reject before storing

Residual risk: Low.

T5.2: Resource exhaustion via huge variables¶

Threat: Process variables > 1GB → OOM.

Mitigations: - Per-variable max size (100 KB default — ver Intuit benchmark) - Per-instance max total variables (10 MB default) - Strict enforcement at API level - Reject command, return 413 Payload Too Large

Residual risk: Low.

T5.3: Process instance flood¶

Threat: Attacker creates 1M process instances per second.

Mitigations: - Rate limit per-tenant (ver backpressure rest strategy) - Quota per tenant (max active instances) - Cost-based throttling (premium gets more) - Auto-suspend tenant on suspicious activity

Residual risk: Low.

T5.4: Infinite loop in BPMN¶

Threat: Process modeled with infinite cycle (deliberate or bug).

Mitigations: - Detection: per-instance event count > threshold (10K events) → incident - Engine timeout for sub-process completion - Max recursion depth para call activities - Operator can cancel instance

Residual risk: Low.

T5.5: Job starvation¶

Threat: One process type dominates jobs queue → others starve.

Mitigations: - Fair scheduling per job queue fairness - Per-tenant queue partitioning - Worker concurrency per job type configurable - Priority levels respected

Residual risk: Low con fair queue.

E - Elevation of Privilege¶

T6.1: Operator escalates to admin¶

Threat: Operator finds bug que permite admin actions.

Mitigations: - Permission check EN CADA endpoint (no implicit assumptions) - Tests específicos por permission level - RBAC strict (3 roles, ver ADR-013) - Audit log de admin actions - Separation of duties (admin can't be both creator and approver of own actions)

Residual risk: Low si tests comprehensive.

T6.2: SQL injection → DB privilege escalation¶

Threat: Via SQL injection, attacker runs GRANT or accesses superuser.

Mitigations: - Engine DB user NO superuser - Engine DB user limited permissions: - INSERT, UPDATE, DELETE en engine tables - NO CREATE, DROP, ALTER - NO GRANT - Migrations run separately con different credentials - DB user can't access pg_authid (passwords)

Residual risk: Very low.

T6.3: Container escape via worker¶

Threat: Worker exploits container vulnerability → host access.

Mitigations: - Workers run in restricted containers (read-only filesystem, no NET_RAW) - Seccomp/AppArmor profiles - Runtime security tools (Falco) - Image scanning (Trivy, Snyk) - Pin base images by SHA - Update regularly

Residual risk: Medium (container security is hard).

T6.4: Supply chain attack¶

Threat: Malicious dependency (npm/PyPI) → backdoor.

Mitigations: - Lock files committed (package-lock.json, poetry.lock) - Dependency scanning (Snyk, Dependabot, Renovate) - Limit transitive dependencies - Internal package mirror si possible - SBOM generation - Sign releases (Sigstore)

Residual risk: Medium (ongoing concern, hard to eliminate fully).

Multi-tenancy security deep-dive¶

Crítico porque shared engine instance serves múltiples tenants.

Defense layers¶

flowchart TD
    L1[Layer 1: API authentication tenant from JWT]
    L2[Layer 2: Authorization role check per endpoint]
    L3[Layer 3: Request handlers tenant_id from auth, never from body]
    L4[Layer 4: SQL queries tenant_id en WHERE clause]
    L5[Layer 5: Postgres RLS defense in depth]
    L6[Layer 6: Audit logging anomaly detection]
    L1 --> L2 --> L3 --> L4 --> L5 --> L6

Common bugs to avoid¶

# BAD: tenant from request body
async def get_instance(request):
    body = await request.json()
    tenant_id = body['tenant_id']  # ← attacker controls!
    return await db.fetch_one("SELECT * FROM ... WHERE tenant_id = $1", tenant_id)

# GOOD: tenant from JWT (server-controlled)
async def get_instance(request):
    tenant_id = request.user.tenant_id  # from JWT claim
    return await db.fetch_one("SELECT * FROM ... WHERE tenant_id = $1", tenant_id)

# BAD: forget tenant_id in WHERE
async def cancel_instance(pi_key):
    await db.execute("UPDATE process_instances SET state = 'CANCELED' WHERE process_instance_key = $1", pi_key)

# GOOD: always include tenant_id
async def cancel_instance(pi_key, tenant_id):
    result = await db.execute("""
        UPDATE process_instances 
        SET state = 'CANCELED' 
        WHERE process_instance_key = $1 AND tenant_id = $2
    """, pi_key, tenant_id)
    if result.rowcount == 0:
        return 404  # or 403

Testing tenant isolation¶

async def test_tenant_isolation():
    """Tenant A cannot access tenant B's data."""
    # Create instances for both tenants
    instance_a = await create_instance(tenant='a', ...)
    instance_b = await create_instance(tenant='b', ...)

    # Try to query B's instance as user A
    user_a_token = create_jwt(tenant='a', role='admin')
    response = await client.get(
        f"/v2/process-instances/{instance_b.key}",
        headers={"Authorization": f"Bearer {user_a_token}"}
    )

    assert response.status_code == 404  # not 403 (avoid info leak)

Test suite específico para cross-tenant violations.

Authentication flow security¶

sequenceDiagram
    participant C as Cliente
    participant IdP as IdP (Auth0/Okta)
    participant E as Engine

    C->>IdP: Auth request (TLS 1.3, State CSRF, PKCE)
    IdP-->>C: Authorization code
    C->>IdP: Exchange code + PKCE verifier (short-lived, one-time)
    IdP-->>C: access_token (JWT, 15min) + refresh_token
    C->>E: API request with access_token
    Note over E: Validate JWT signature, expiry, audience,<br/>issuer, revocation list
    E-->>C: Process if valid

Encryption at rest¶

Postgres database¶

Options ordered by strength:

TDE (Transparent Data Encryption): encrypted at filesystem level
pgcrypto extension: encrypt specific columns
Application-level encryption: encrypt sensitive fields before INSERT

Recomendación MVP: filesystem encryption (LUKS, cloud-native) + pgcrypto for ultra-sensitive (PII).

Backups¶

pgbackrest --cipher-type=aes-256-cbc \
           --cipher-pass=$CIPHER_KEY \
           backup

Keys management: - Vault (HashiCorp) - AWS KMS / GCP KMS / Azure Key Vault - Sealed Secrets (Kubernetes)

NEVER en env vars o config files.

Audit logging (ADR-025 candidate)¶

CREATE TABLE audit_log (
    audit_id BIGSERIAL PRIMARY KEY,
    timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    tenant_id TEXT NOT NULL,
    user_id TEXT,                    -- from JWT sub
    api_key_id UUID,                 -- if API key auth
    source_ip INET NOT NULL,
    user_agent TEXT,
    action TEXT NOT NULL,            -- e.g., 'process_instance.cancel'
    resource_type TEXT NOT NULL,
    resource_id TEXT,
    details JSONB NOT NULL,          -- before/after, request params
    success BOOLEAN NOT NULL,
    error_message TEXT,
    integrity_hmac TEXT NOT NULL     -- HMAC-SHA256 of row content
);

-- Append-only enforcement
REVOKE UPDATE, DELETE ON audit_log FROM ALL;

-- Index for queries
CREATE INDEX idx_audit_tenant_time ON audit_log(tenant_id, timestamp DESC);
CREATE INDEX idx_audit_user ON audit_log(user_id, timestamp DESC);

Loggable actions (mandatory): - Authentication (success/failure) - All POST/PATCH/DELETE on resources - Deployment changes - User/role changes - Permission changes - Configuration changes

NOT logged: read operations (would be massive, not security-relevant in most cases). Exception: read of sensitive resources (incident details, audit log itself).

Common attack scenarios¶

Scenario 1: Compromised worker API key¶

Detection: anomalous usage pattern (different IP, different time)
Response: revoke key, rotate
Forensics: audit log shows all actions with that key
Mitigation: rotate other keys if shared compromise suspected

Scenario 2: SQL injection found in code review¶

Fix: parameterize query
Audit log review: any successful exploitation?
Patch + deploy emergency
Add CodeQL/Semgrep rule to detect pattern

Scenario 3: Tenant data leak suspected¶

Audit log: which user accessed what when
Check RLS policies still active
Review code for missing tenant_id checks
Notify affected tenant per compliance (GDPR 72 hours)

Scenario 4: DDoS attempt¶

Auto-scaling triggered (Phase 2+)
Rate limiter blocks suspicious sources
Cloudflare/WAF en front
Monitor: which tenants affected, which IPs

Compliance considerations¶

Depending on target market:

Compliance	Requirements covered	Gaps
SOC2	Audit log, access control, encryption	Penetration testing needed
GDPR	Right to deletion, data minimization	Privacy impact assessment
HIPAA	Encryption, audit, access	BAAs needed con providers
SOX	Audit log retention, segregation of duties	Specific to public companies
PCI-DSS	If processing cards (rare for workflow)	Don't store card data

MVP Phase 1: target SOC2-readiness as baseline.

Threats explicitly out of scope (Phase 1)¶

Defer to Phase 2+:

DDoS protection at scale (use cloud WAF: Cloudflare, AWS Shield)
Penetration testing (annual, external firm)
Bug bounty program
Custom security audit of dependencies
HSM integration for key management
FedRAMP/IL5 compliance

Resulting ADRs¶

This analysis genera 3 ADRs nuevos:

ADR-023: TLS 1.3 mandatory en producción (mitigation T2.4)
ADR-024: Postgres RLS as defense in depth (mitigation T4.1)
ADR-025: Audit logging mandatory para security operations (mitigation T3.1, T3.2)

Links¶

adr 013 simple rbac three roles — Authorization model
adr 014 oidc single idp — Authentication
multi tenancy — Tenant isolation
backpressure rest strategy — DoS mitigation
failure mode analysis — Complementary
STRIDE methodology
OWASP API Security Top 10