Failure Mode Analysis

Análisis sistemático de modes of failure del MVP per componente. Para cada failure: detection mechanism, blast radius, RTO/RPO objectives, mitigation strategy, recovery procedure. Cubre: engine crash, DB failures, network partitions, worker failures, BPMN edge cases, data corruption, clock issues. Generates runbook entries reusables como base para SRE on-call.

Metodología FMEA¶

Por cada componente: 1. Identify failure modes (qué puede fallar) 2. Effects (qué impacto tiene) 3. Detection (cómo se descubre) 4. Severity (1-5 scale) 5. Likelihood (1-5 scale) 6. Mitigation (cómo prevent/reduce) 7. Recovery (qué hacer cuando ocurre)

Severity scale¶

Sev	Description	Example
5	Catastrophic data loss	DB corruption, no backup
4	Major service disruption	Engine down > 1 hour
3	Degraded service	High latency, partial outage
2	Minor user-facing issue	Single feature unavailable
1	Cosmetic	Log noise

Likelihood scale¶

Lik	Description	Frequency
5	Almost certain	Daily
4	Likely	Weekly
3	Possible	Monthly
2	Unlikely	Yearly
1	Rare	Multi-year

Risk Priority Number (RPN) = Severity × Likelihood. Address high-RPN first.

Engine failures¶

F1: Engine process crash (OOM, panic, segfault)¶

Aspect	Detail
Detection	Health endpoint 503, process exit
Effects	All in-flight requests fail. New requests fail until restart.
Severity	4
Likelihood	3
RPN	12
Mitigation	Process supervisor (systemd, Docker restart policy, Kubernetes) auto-restart. Memory limits enforced.
Recovery	Auto-restart en 5-30s. Recovery scans command_log para resume processing (per api engine serialization). Pending timers re-loaded (per timer recovery postgres).
RTO	30s
RPO	0 (commands en log persist)
Runbook	1. Check logs for crash cause. 2. Verify auto-restart succeeded. 3. If repeated: investigate OOM/memory leak. 4. Alert if down > 5 min.

F2: Engine processing loop stuck¶

Aspect	Detail
Detection	last_processed_position no avanza. Queue depth growing.
Effects	New commands queue up, no processing. Eventually backpressure trips.
Severity	4
Likelihood	2
RPN	8
Mitigation	Watchdog timer: if no progress for 60s, log warning and restart. Bounded operations (timeouts on DB calls, etc.).
Recovery	Restart engine. Investigate via traces what processor stuck.
RTO	5 min
RPO	0
Runbook	1. Check `engine.processing.position` metric. 2. If stuck > 60s, restart. 3. Pull stack trace before restart for diagnosis.

F3: Replay determinism violated (data inconsistency)¶

Aspect	Detail
Detection	`ContinuouslyReplayTest` fails in CI. State divergence after failover.
Effects	Followers/replicas have different state than primary. Catastrophic in cluster.
Severity	5
Likelihood	1
RPN	5
Mitigation	Property-based testing in CI (ADR-019). Code reviews enforce determinism rules (no `time.now()` direct, etc.).
Recovery	Halt engine. Identify divergent state. Manual reconciliation. Possible: restore from backup, replay log.
RTO	Hours to days
RPO	0 if backup recent
Runbook	This is a P0. Page on-call immediately. Investigate carefully — don't rush recovery.

F4: Memory leak in engine¶

Aspect	Detail
Detection	Memory metric trending up over hours/days. OOM eventually.
Effects	OOM crash. Service disruption during restart.
Severity	3
Likelihood	3
RPN	9
Mitigation	Memory profiling in CI (e.g., pprof in Go). Bounded caches. Periodic process restart (cron, weekly) as defense.
Recovery	Restart. Profile to find leak. Fix code.
RTO	30s (auto-restart)
RPO	0
Runbook	If recurring: schedule weekly proactive restart until fix deployed.

Database failures¶

F5: Postgres primary failure¶

Aspect	Detail
Detection	Patroni health checks fail. Replicas notice.
Effects	All writes blocked. Reads may work on replicas.
Severity	4
Likelihood	2
RPN	8
Mitigation	Patroni automatic failover (ADR-020). Replicas continuously syncing.
Recovery	Patroni promotes replica → ~30s downtime. PgBouncer reconnects to new primary.
RTO	30s
RPO	~5s (WAL streaming lag)
Runbook	1. Verify Patroni promoted replica. 2. Check failed primary cause (disk full, hardware?). 3. Re-introduce as replica after fix.

F6: Postgres replication lag high¶

Aspect	Detail
Detection	`pg_stat_replication.lag_bytes` exceeds threshold
Effects	Read replicas serve stale data. Failover would lose data.
Severity	3
Likelihood	3
RPN	9
Mitigation	Alert if lag > 100 MB. Tune network. Increase WAL replay parallelism.
Recovery	Investigate cause: network slow? Replica overloaded? Add more replicas?
RTO	N/A (degraded)
RPO	Variable based on lag
Runbook	Alert: `pg_replication_lag_bytes > 100000000`. Investigate: replica CPU/IO bottleneck.

F7: Database disk full¶

Aspect	Detail
Detection	Disk usage > 90%. Postgres may go read-only at 95%+.
Effects	Writes fail. Engine cannot persist commands. Service down.
Severity	5
Likelihood	2
RPN	10
Mitigation	Disk usage monitoring (alert at 80%). Auto-extend cloud volumes. Retention policy enforced (per command log compaction).
Recovery	1. Extend disk (if cloud, immediate). 2. Drop old partitions (command log compaction). 3. Archive to S3.
RTO	Minutes to hours
RPO	0
Runbook	URGENT: alert at 80%, P1 at 90%, P0 at 95%. Auto-extend if possible.

F8: Database corruption¶

Aspect	Detail
Detection	Postgres `data-checksums` errors. Queries return errors.
Effects	Some data unreadable. May be silent until accessed.
Severity	5
Likelihood	1
RPN	5
Mitigation	Enable `data-checksums` at initdb. Hardware ECC RAM. RAID for disks. Regular backups verified.
Recovery	Restore from last known good backup (PITR with pgBackRest). Lose data after backup point.
RTO	Hours
RPO	Up to backup interval (typically 5-15 min with WAL streaming)
Runbook	P0. Page DBA. Halt writes (read-only mode). Restore from backup. Notify affected tenants.

F9: Connection pool exhausted¶

Aspect	Detail
Detection	New connections rejected. App errors.
Effects	New requests fail. Existing connections continue.
Severity	3
Likelihood	3
RPN	9
Mitigation	PgBouncer with proper sizing (`pool_size`). Application connection pool limits.
Recovery	Identify what holding connections. Kill long queries. Restart engine if needed.
RTO	Minutes
RPO	0
Runbook	1. `SELECT count(*) FROM pg_stat_activity GROUP BY state`. 2. Kill long-running queries. 3. Investigate why connections leak.

F10: Postgres query degradation (slow queries)¶

Aspect	Detail
Detection	`pg_stat_statements` shows high `mean_time` for common queries. Tail latency increases.
Effects	High latency. Queue depth grows. Potential timeout cascade.
Severity	3
Likelihood	4
RPN	12
Mitigation	Indexes from day 1 (per schema designs). pg_stat_statements monitoring. Regular ANALYZE. Vacuum tuning.
Recovery	Identify slow query. Add missing index. Rewrite query. Increase resources.
RTO	Hours
RPO	0
Runbook	Monitor postgres monitoring queries weekly. Top slow queries: review and optimize.

Network failures¶

F11: Network partition between engine and DB¶

Aspect	Detail
Detection	Connection failures spike.
Effects	Engine cannot process. Health endpoint may fail.
Severity	4
Likelihood	2
RPN	8
Mitigation	Retry with backoff in DB client. Circuit breaker. Multiple AZs.
Recovery	Network restoration. Engine resumes processing.
RTO	Minutes
RPO	0
Runbook	Verify infrastructure (cloud provider status, network policies).

F12: Network partition between engine instances (Phase 2+)¶

Aspect	Detail
Detection	Leader election thrashing. Multiple "leaders" in logs.
Effects	Potential split-brain. State inconsistency risk.
Severity	5
Likelihood	1
RPN	5
Mitigation	Postgres advisory locks via single source of truth (DB primary). Cluster size odd number to maintain quorum.
Recovery	Network healing. One leader emerges. Investigate split-brain effects in audit log.
RTO	Minutes
RPO	0 (Postgres serializes)
Runbook	Verify only one engine is leader (`SELECT pg_advisory_lock_status`).

Worker failures¶

F13: Worker crashes mid-job¶

Aspect	Detail
Detection	Job timeout expires without completion.
Effects	Job re-distributed to another worker. Possible duplicate work (per ADR-007).
Severity	2
Likelihood	4
RPN	8
Mitigation	Worker idempotency required (ADR-007). Activation versioning. Reasonable timeouts.
Recovery	Engine auto-reactivates job. Other worker picks up.
RTO	Equal to job timeout (~30s)
RPO	0
Runbook	Investigate worker crash cause if frequent.

F14: Worker hangs indefinitely¶

Aspect	Detail
Detection	Job timeout expires. Worker still "running" but unresponsive.
Effects	Same as F13 + worker resource leak.
Severity	3
Likelihood	3
RPN	9
Mitigation	Worker SDK enforces timeouts (kill async tasks after N seconds). Health checks for worker process.
Recovery	Restart worker. Job continues elsewhere.
RTO	Job timeout
RPO	0
Runbook	Investigate worker hang pattern. May indicate external API issue.

F15: Worker not idempotent → duplicate side effects¶

Aspect	Detail
Detection	User reports double charges, duplicate emails. Audit log shows duplicate completions.
Effects	Real-world impact (financial, customer experience).
Severity	4
Likelihood	3
RPN	12
Mitigation	Worker idempotency mandatory (ADR-007). Code review checks. Documentation.
Recovery	Identify affected operations. Manual remediation (refund, apology). Fix worker code.
RTO	Hours to days
RPO	N/A
Runbook	Audit duplicate processing patterns in worker logs. Implement idempotency immediately.

BPMN edge cases¶

F16: Malformed BPMN deployment¶

Aspect	Detail
Detection	Deployment endpoint returns 400 with details.
Effects	User cannot deploy. No production impact.
Severity	1
Likelihood	3
RPN	3
Mitigation	Validation (XSD + semantic + MVP-specific). Helpful error messages.
Recovery	User fixes BPMN and retries.
RTO	N/A
RPO	N/A
Runbook	N/A (user issue).

F17: BPMN with unsupported elements¶

Aspect	Detail
Detection	Deployment rejected with element list.
Effects	User cannot deploy that BPMN.
Severity	2
Likelihood	4
RPN	8
Mitigation	Documented supported subset. Conversion tool from Camunda BPMN.
Recovery	User modifies BPMN to use supported elements.
RTO	N/A
RPO	N/A
Runbook	Direct user to supported elements docs.

F18: FEEL/CEL expression evaluation fails¶

Aspect	Detail
Detection	Incident created (EXTRACT_VALUE_ERROR).
Effects	Process instance stuck pending incident resolution.
Severity	2
Likelihood	4
RPN	8
Mitigation	Expression validation at deploy time. Helpful error messages including variable context.
Recovery	Admin resolves incident: fix variable values, retry. Or cancel instance.
RTO	Minutes (manual)
RPO	N/A
Runbook	Operate-equivalent shows incident details. Admin reviews and resolves.

F19: Infinite loop in BPMN¶

Aspect	Detail
Detection	Per-instance event count > threshold. Memory/storage growth.
Effects	One instance consumes huge resources. Other instances OK.
Severity	3
Likelihood	2
RPN	6
Mitigation	Detection: > 10K events per instance = warning, > 100K = auto-suspend.
Recovery	Operator cancels infinite instance. Fix BPMN.
RTO	Manual (hours)
RPO	N/A
Runbook	Alert: `process_instance_events > 100000`. Investigate which instance. Cancel.

Clock issues¶

F20: Clock skew between engines¶

Aspect	Detail
Detection	Timer fire times inconsistent. Audit logs show non-monotonic timestamps.
Effects	Timer fires at wrong time. Replay determinism issues (engine fires before instructed).
Severity	4
Likelihood	2
RPN	8
Mitigation	NTP/chrony on all hosts. Use DB time (`NOW()`) as authoritative when possible. Clock alert if > 1s drift.
Recovery	Force NTP sync. Investigate hardware issues.
RTO	Minutes
RPO	0
Runbook	Monitor `node_timex_offset_seconds`. Alert > 1s.

F21: Clock jumps backward¶

Aspect	Detail
Detection	Timestamps in DB go backwards. Engine may panic.
Effects	Replay non-deterministic. Multiple weird bugs.
Severity	5
Likelihood	1
RPN	5
Mitigation	Monotonic clock used for durations (not wall clock). NTP slew mode (gradual adjustment).
Recovery	Halt engine. Fix clock source. Investigate damage.
RTO	Hours
RPO	0
Runbook	P0. This is dangerous. Halt and investigate.

Cascading failures¶

F22: Slow worker → timer fire storm → engine overload¶

Chain: 1. Worker pool slow processing (overloaded external API) 2. Jobs timeout → re-activated 3. Timers fire on retry logic → more commands 4. Engine queue grows → backpressure trips 5. New legitimate work rejected

Mitigation: separate worker pools per job type. Per-job-type rate limiting. Circuit breaker for external services in workers.

F23: DB slow → engine queue grows → OOM¶

Chain: 1. DB query degrades (missing index) 2. Engine processing slows down 3. Commands queue grows in memory 4. Memory pressure → swap → slower 5. OOM → restart → backlog → repeat

Mitigation: bounded queue (per backpressure rest strategy). Aggressive DB query monitoring. Alerts on queue depth.

F24: Cache invalidation storm¶

Chain: 1. Multiple deploys in short time 2. Each invalidates process_definition cache 3. Engine re-parses repeatedly 4. CPU saturated 5. All processing slows

Mitigation: rate limit deploys (1 per second per tenant). Cache also has TTL (not just invalidation-based).

Recovery procedures¶

Common procedures¶

Drain mode (graceful shutdown)¶

# Stop accepting new commands
mvp-cli engine drain --instance=engine-1

# Wait for queue to drain
mvp-cli engine wait-for-quiet --instance=engine-1 --timeout=5m

# Now safe to restart/maintenance
systemctl stop engine

Catastrophic recovery (PITR)¶

# 1. Halt all engines
mvp-cli cluster halt

# 2. Restore Postgres to PITR
pgbackrest --stanza=main --type=time --target="2025-05-14 10:00:00" restore

# 3. Replay command_log from logs after restore point (if any survived)
mvp-cli engine replay --from-position=$LAST_COMMITTED

# 4. Resume normal operation
mvp-cli cluster start

Tenant-specific issue¶

# Suspend tenant (no new operations accepted)
mvp-cli tenant suspend acme --reason="investigation"

# Investigate via audit log
mvp-cli audit query --tenant=acme --since="2025-05-14 09:00" --action="*"

# Resolve issue
# ...

# Resume
mvp-cli tenant resume acme

Priority matrix¶

Ordering by RPN (highest priority first):

F#	Failure	RPN	Phase
F1	Engine crash	12	Mitigate Phase 0
F10	Slow queries	12	Mitigate Phase 0
F15	Worker non-idempotent	12	Education ongoing
F7	Disk full	10	Mitigate Phase 0
F4	Memory leak	9	Mitigate Phase 1
F6	Replication lag	9	Mitigate Phase 1
F9	Connection exhausted	9	Mitigate Phase 0
F14	Worker hangs	9	Mitigate Phase 0
F5	DB primary fails	8	Phase 1 (Patroni)
F11	Network partition DB	8	Phase 1
F13	Worker crash	8	Mitigate Phase 0
F17	Unsupported BPMN	8	Education
F18	Expression fails	8	Mitigate Phase 0
F20	Clock skew	8	Mitigate Phase 0
F19	Infinite loop	6	Phase 1
F8	DB corruption	5	Mitigate Phase 0
F3	Replay determinism	5	Test Phase 0
F12	Engine split-brain	5	Phase 2 design
F21	Clock jumps back	5	Mitigate Phase 0
F2	Engine stuck	8	Mitigate Phase 0
F16	Malformed BPMN	3	Documentation

Resulting actions¶

Phase 0 mitigations (must have):

Phase 1 additions:

Links¶

security threat model — Security failures
adr 019 replay determinism invariant — F3 mitigation
adr 007 at least once idempotent workers — F13, F15 mitigations
adr 020 patroni postgres ha — F5 mitigation
postgres monitoring — F10 monitoring
backpressure rest strategy — F23 mitigation