Security Threat Model
Análisis de threats STRIDE-based para el MVP. Identifica 28 threats específicos del workflow engine, sus mitigations, y residual risks. Areas: auth/authz, multi-tenancy isolation, data at rest/transit, worker security, BPMN parsing attacks, supply chain. Defense in depth: TLS everywhere + RLS + audit logging + secret management. Genera ADR-023 (TLS-only), ADR-024 (RLS defense in depth), ADR-025 (audit logging mandatory).
Por qué este análisis¶
Workflow engines almacenan business-critical state: orders, approvals, financial transactions, customer data. Una breach es catastrofic: - Data exfiltration (customer data) - Tampering (alterar approvals) - DoS (sistema down = negocio down) - Privilege escalation (operator → admin → root)
Camunda no publica threat model formal pero implementa muchos mitigations. Este análisis es para el MVP.
STRIDE framework¶
STRIDE = Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege.
Per-component analysis:
| Component | Spoofing | Tampering | Repudiation | Info Disc | DoS | EoP |
|---|---|---|---|---|---|---|
| REST API | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Engine | - | ✓ | ✓ | ✓ | ✓ | - |
| PostgreSQL | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Workers | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Modeler files | - | ✓ | - | ✓ | - | ✓ |
| Webapps UI | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Threats identificados¶
S - Spoofing (impersonation)¶
T1.1: Worker impersonation via stolen API key¶
Threat: Attacker obtiene API key (env var leak, repo commit, log scrape) → ejecuta jobs como worker legítimo.
Mitigations:
- API keys hashed in DB (bcrypt/argon2), NUNCA store plaintext
- API keys con expiration (default 90 days)
- API key rotation tooling (mvp-cli api-keys rotate)
- Per-key audit log (which keys used, when, what jobs)
- Restrict API key scopes (read-only vs read-write)
- Network ACL si possible (VPN, source IP restriction)
Residual risk: Medium. Insider con DB access ve hashes (no plaintext). Active rotation reduces window.
T1.2: User impersonation via stolen JWT¶
Threat: JWT con long expiry stolen → attacker actúa como user.
Mitigations: - JWT expiry corta (15 min recomendado, refresh token con longer expiry) - Token revocation list (Redis or DB) - IP/User-Agent binding (detect anomalies) - Logout invalidates server-side - 2FA via IdP (Auth0/Okta provee)
Residual risk: Low si JWT expiry corta + revocation list activo.
T1.3: Webhook spoofing¶
Threat: Attacker envía fake webhook a POST /v2/messages/{name}/correlate para inject events maliciosos.
Mitigations: - API authentication required (no public webhooks) - HMAC signing si webhook from external system: validate signature - Correlation key validation (no arbitrary values) - Rate limit per source
Residual risk: Medium si user expose webhooks publicly without HMAC.
T - Tampering (modification)¶
T2.1: BPMN model tampering during deployment¶
Threat: Attacker uploads malicious BPMN que ejecuta arbitrary code or exfiltrates data via expressions.
Mitigations:
- BPMN XML strict validation (XSD + semantic)
- Expression engine sandbox (CEL es non-Turing-complete, no I/O)
- NO custom function definitions
- Expression evaluation timeout (1 second)
- Permission check: only admin role can deploy
- Audit log of all deployments
- Diff review process (deploy → review → activate)
Residual risk: Low si CEL strict + admin-only deploy.
T2.2: Database tampering via SQL injection¶
Threat: SQL injection en API endpoints → arbitrary state mutations.
Mitigations: - NEVER string concat SQL — always parameterized queries - Use ORM o query builder - Input validation strict - Postgres role permissions limited (engine NO superuser) - Audit log de DDL operations - SAST tooling (CodeQL, Semgrep) en CI
Residual risk: Very low si discipline en código.
T2.3: Variable tampering¶
Threat: User con write access altera variables of running instance to bypass logic.
Mitigations:
- Permission check UPDATE_VARIABLES strict
- Audit log de variable updates (who, when, before/after)
- Process modification permission separate
- Immutable variables (BPMN attribute) — engine rejects updates
- Cryptographic signatures opcional para sensitive variables
Residual risk: Medium. Variable mutability is by design para BPMN. Audit + permission limita.
T2.4: In-flight data tampering (MITM)¶
Threat: Man-in-the-middle altera requests/responses entre cliente y engine.
Mitigations: - TLS 1.3 mandatory (no plaintext HTTP) → ADR-023 - Certificate pinning en SDKs opcional - mTLS para worker-engine si possible - HSTS headers
Residual risk: Very low con TLS 1.3 properly configured.
R - Repudiation (denial of actions)¶
T3.1: User claims "I didn't approve this loan"¶
Threat: User completes user task → later denies it.
Mitigations: - Comprehensive audit log (mandatory, immutable) - User ID (from JWT) - Timestamp - Action (approve, reject) - IP address, User-Agent - Variables values - JWT claims snapshot - Audit log signed (HMAC con server secret) - Append-only audit table (no UPDATEs allowed) - Retention per compliance (7 years SOX)
→ ADR-025: audit logging mandatory
Residual risk: Low si audit log comprehensive.
T3.2: Worker claims "I didn't complete this job"¶
Threat: Worker procesa malicious action → niega haber sido él.
Mitigations: - Per-job audit: which API key completed - Worker identification en completion (hostname, pid) - Source IP de la request - Activation_count tracking (catch stale completions)
Residual risk: Low.
I - Information Disclosure¶
T4.1: Cross-tenant data leak via API¶
Threat: User of tenant A queries tenant B data via SQL injection o bug.
Mitigations: - Tenant_id en TODA query (enforced via code review + tests) - Postgres Row-Level Security as defense in depth → ADR-024
ALTER TABLE process_instances ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON process_instances
FOR ALL USING (tenant_id = current_setting('app.current_tenant'));
app.current_tenant per connection
- Tests específicos: try to query other tenant → expect rejection
- Tenant_id NEVER user-controllable (always from JWT claims)
Residual risk: Very low con RLS + code discipline.
T4.2: Secrets in BPMN XML or variables¶
Threat: User puts API keys, passwords in process variables → exposed via Operate/Tasklist.
Mitigations:
- Document strongly: variables son TODOS visible
- Sensitive field detection (regex for token, password, apiKey)
- Optional: variable encryption-at-rest with per-tenant key
- BPMN reference to secret store, NOT inline:
Residual risk: Medium si users no follow guidelines.
T4.3: Sensitive data in logs¶
Threat: Engine logs request bodies con sensitive data.
Mitigations: - Structured logging (JSON), NEVER log full request body - Log only IDs y intents, not variable values - Mask sensitive fields automaticamente:
SENSITIVE_PATTERNS = [r'password', r'token', r'apiKey', r'ssn', r'creditCard']
def sanitize(obj):
# Replace matching keys with [REDACTED]
Residual risk: Low.
T4.4: Database backup exfiltration¶
Threat: Backup files stolen → contains ALL business data.
Mitigations: - Backups encrypted at rest (pgBackRest --cipher-type) - S3 server-side encryption (SSE) - Backup access ACL strict (separate IAM role) - Backup keys rotated regularly - Database-level encryption opcional (Postgres + TDE via Cipher)
Residual risk: Low si encryption properly configured.
T4.5: Memory dump exfiltration¶
Threat: Process memory contains decrypted data, variables.
Mitigations:
- Disable core dumps en production (ulimit -c 0)
- Memory protection (ASLR, DEP)
- Clear sensitive data from memory after use:
# Use bytearray, zero after use
password_bytes = bytearray(password.encode())
# ... use ...
for i in range(len(password_bytes)):
password_bytes[i] = 0
Residual risk: Medium. Defense-in-depth, not preventable absolutely.
D - Denial of Service¶
T5.1: Resource exhaustion via huge BPMN¶
Threat: Upload 10MB BPMN with 100K elements → memory explosion.
Mitigations: - Max BPMN size limit (1 MB default) - Max elements per process (1000 default) - Max nesting depth (10 levels) - Parse timeout (5 seconds) - Reject before storing
Residual risk: Low.
T5.2: Resource exhaustion via huge variables¶
Threat: Process variables > 1GB → OOM.
Mitigations: - Per-variable max size (100 KB default — ver Intuit benchmark) - Per-instance max total variables (10 MB default) - Strict enforcement at API level - Reject command, return 413 Payload Too Large
Residual risk: Low.
T5.3: Process instance flood¶
Threat: Attacker creates 1M process instances per second.
Mitigations: - Rate limit per-tenant (ver concepts/backpressure-rest-strategy) - Quota per tenant (max active instances) - Cost-based throttling (premium gets more) - Auto-suspend tenant on suspicious activity
Residual risk: Low.
T5.4: Infinite loop in BPMN¶
Threat: Process modeled with infinite cycle (deliberate or bug).
Mitigations: - Detection: per-instance event count > threshold (10K events) → incident - Engine timeout for sub-process completion - Max recursion depth para call activities - Operator can cancel instance
Residual risk: Low.
T5.5: Job starvation¶
Threat: One process type dominates jobs queue → others starve.
Mitigations: - Fair scheduling per concepts/job-queue-fairness - Per-tenant queue partitioning - Worker concurrency per job type configurable - Priority levels respected
Residual risk: Low con fair queue.
E - Elevation of Privilege¶
T6.1: Operator escalates to admin¶
Threat: Operator finds bug que permite admin actions.
Mitigations: - Permission check EN CADA endpoint (no implicit assumptions) - Tests específicos por permission level - RBAC strict (3 roles, ver ADR-013) - Audit log de admin actions - Separation of duties (admin can't be both creator and approver of own actions)
Residual risk: Low si tests comprehensive.
T6.2: SQL injection → DB privilege escalation¶
Threat: Via SQL injection, attacker runs GRANT or accesses superuser.
Mitigations: - Engine DB user NO superuser - Engine DB user limited permissions: - INSERT, UPDATE, DELETE en engine tables - NO CREATE, DROP, ALTER - NO GRANT - Migrations run separately con different credentials - DB user can't access pg_authid (passwords)
Residual risk: Very low.
T6.3: Container escape via worker¶
Threat: Worker exploits container vulnerability → host access.
Mitigations: - Workers run in restricted containers (read-only filesystem, no NET_RAW) - Seccomp/AppArmor profiles - Runtime security tools (Falco) - Image scanning (Trivy, Snyk) - Pin base images by SHA - Update regularly
Residual risk: Medium (container security is hard).
T6.4: Supply chain attack¶
Threat: Malicious dependency (npm/PyPI) → backdoor.
Mitigations: - Lock files committed (package-lock.json, poetry.lock) - Dependency scanning (Snyk, Dependabot, Renovate) - Limit transitive dependencies - Internal package mirror si possible - SBOM generation - Sign releases (Sigstore)
Residual risk: Medium (ongoing concern, hard to eliminate fully).
Multi-tenancy security deep-dive¶
Crítico porque shared engine instance serves múltiples tenants.
Defense layers¶
flowchart TD
L1[Layer 1: API authentication tenant from JWT]
L2[Layer 2: Authorization role check per endpoint]
L3[Layer 3: Request handlers tenant_id from auth, never from body]
L4[Layer 4: SQL queries tenant_id en WHERE clause]
L5[Layer 5: Postgres RLS defense in depth]
L6[Layer 6: Audit logging anomaly detection]
L1 --> L2 --> L3 --> L4 --> L5 --> L6
Common bugs to avoid¶
# BAD: tenant from request body
async def get_instance(request):
body = await request.json()
tenant_id = body['tenant_id'] # ← attacker controls!
return await db.fetch_one("SELECT * FROM ... WHERE tenant_id = $1", tenant_id)
# GOOD: tenant from JWT (server-controlled)
async def get_instance(request):
tenant_id = request.user.tenant_id # from JWT claim
return await db.fetch_one("SELECT * FROM ... WHERE tenant_id = $1", tenant_id)
# BAD: forget tenant_id in WHERE
async def cancel_instance(pi_key):
await db.execute("UPDATE process_instances SET state = 'CANCELED' WHERE process_instance_key = $1", pi_key)
# GOOD: always include tenant_id
async def cancel_instance(pi_key, tenant_id):
result = await db.execute("""
UPDATE process_instances
SET state = 'CANCELED'
WHERE process_instance_key = $1 AND tenant_id = $2
""", pi_key, tenant_id)
if result.rowcount == 0:
return 404 # or 403
Testing tenant isolation¶
async def test_tenant_isolation():
"""Tenant A cannot access tenant B's data."""
# Create instances for both tenants
instance_a = await create_instance(tenant='a', ...)
instance_b = await create_instance(tenant='b', ...)
# Try to query B's instance as user A
user_a_token = create_jwt(tenant='a', role='admin')
response = await client.get(
f"/v2/process-instances/{instance_b.key}",
headers={"Authorization": f"Bearer {user_a_token}"}
)
assert response.status_code == 404 # not 403 (avoid info leak)
Test suite específico para cross-tenant violations.
Authentication flow security¶
sequenceDiagram
participant C as Cliente
participant IdP as IdP (Auth0/Okta)
participant E as Engine
C->>IdP: Auth request (TLS 1.3, State CSRF, PKCE)
IdP-->>C: Authorization code
C->>IdP: Exchange code + PKCE verifier (short-lived, one-time)
IdP-->>C: access_token (JWT, 15min) + refresh_token
C->>E: API request with access_token
Note over E: Validate JWT signature, expiry, audience,<br/>issuer, revocation list
E-->>C: Process if valid
Encryption at rest¶
Postgres database¶
Options ordered by strength:
- TDE (Transparent Data Encryption): encrypted at filesystem level
- pgcrypto extension: encrypt specific columns
- Application-level encryption: encrypt sensitive fields before INSERT
Recomendación MVP: filesystem encryption (LUKS, cloud-native) + pgcrypto for ultra-sensitive (PII).
Backups¶
Keys management: - Vault (HashiCorp) - AWS KMS / GCP KMS / Azure Key Vault - Sealed Secrets (Kubernetes)
NEVER en env vars o config files.
Audit logging (ADR-025 candidate)¶
CREATE TABLE audit_log (
audit_id BIGSERIAL PRIMARY KEY,
timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
tenant_id TEXT NOT NULL,
user_id TEXT, -- from JWT sub
api_key_id UUID, -- if API key auth
source_ip INET NOT NULL,
user_agent TEXT,
action TEXT NOT NULL, -- e.g., 'process_instance.cancel'
resource_type TEXT NOT NULL,
resource_id TEXT,
details JSONB NOT NULL, -- before/after, request params
success BOOLEAN NOT NULL,
error_message TEXT,
integrity_hmac TEXT NOT NULL -- HMAC-SHA256 of row content
);
-- Append-only enforcement
REVOKE UPDATE, DELETE ON audit_log FROM ALL;
-- Index for queries
CREATE INDEX idx_audit_tenant_time ON audit_log(tenant_id, timestamp DESC);
CREATE INDEX idx_audit_user ON audit_log(user_id, timestamp DESC);
Loggable actions (mandatory): - Authentication (success/failure) - All POST/PATCH/DELETE on resources - Deployment changes - User/role changes - Permission changes - Configuration changes
NOT logged: read operations (would be massive, not security-relevant in most cases). Exception: read of sensitive resources (incident details, audit log itself).
Common attack scenarios¶
Scenario 1: Compromised worker API key¶
- Detection: anomalous usage pattern (different IP, different time)
- Response: revoke key, rotate
- Forensics: audit log shows all actions with that key
- Mitigation: rotate other keys if shared compromise suspected
Scenario 2: SQL injection found in code review¶
- Fix: parameterize query
- Audit log review: any successful exploitation?
- Patch + deploy emergency
- Add CodeQL/Semgrep rule to detect pattern
Scenario 3: Tenant data leak suspected¶
- Audit log: which user accessed what when
- Check RLS policies still active
- Review code for missing tenant_id checks
- Notify affected tenant per compliance (GDPR 72 hours)
Scenario 4: DDoS attempt¶
- Auto-scaling triggered (Phase 2+)
- Rate limiter blocks suspicious sources
- Cloudflare/WAF en front
- Monitor: which tenants affected, which IPs
Compliance considerations¶
Depending on target market:
| Compliance | Requirements covered | Gaps |
|---|---|---|
| SOC2 | Audit log, access control, encryption | Penetration testing needed |
| GDPR | Right to deletion, data minimization | Privacy impact assessment |
| HIPAA | Encryption, audit, access | BAAs needed con providers |
| SOX | Audit log retention, segregation of duties | Specific to public companies |
| PCI-DSS | If processing cards (rare for workflow) | Don't store card data |
MVP Phase 1: target SOC2-readiness as baseline.
Threats explicitly out of scope (Phase 1)¶
Defer to Phase 2+:
- DDoS protection at scale (use cloud WAF: Cloudflare, AWS Shield)
- Penetration testing (annual, external firm)
- Bug bounty program
- Custom security audit of dependencies
- HSM integration for key management
- FedRAMP/IL5 compliance
Resulting ADRs¶
This analysis genera 3 ADRs nuevos:
- ADR-023: TLS 1.3 mandatory en producción (mitigation T2.4)
- ADR-024: Postgres RLS as defense in depth (mitigation T4.1)
- ADR-025: Audit logging mandatory para security operations (mitigation T3.1, T3.2)
Links¶
- adrs/adr-013-simple-rbac-three-roles — Authorization model
- adrs/adr-014-oidc-single-idp — Authentication
- concepts/multi-tenancy — Tenant isolation
- concepts/backpressure-rest-strategy — DoS mitigation
- analysis/failure-mode-analysis — Complementary
- STRIDE methodology
- OWASP API Security Top 10