Failure Mode Analysis
Análisis sistemático de modes of failure del MVP per componente. Para cada failure: detection mechanism, blast radius, RTO/RPO objectives, mitigation strategy, recovery procedure. Cubre: engine crash, DB failures, network partitions, worker failures, BPMN edge cases, data corruption, clock issues. Generates runbook entries reusables como base para SRE on-call.
Metodología FMEA
Por cada componente:
1. Identify failure modes (qué puede fallar)
2. Effects (qué impacto tiene)
3. Detection (cómo se descubre)
4. Severity (1-5 scale)
5. Likelihood (1-5 scale)
6. Mitigation (cómo prevent/reduce)
7. Recovery (qué hacer cuando ocurre)
Severity scale
| Sev |
Description |
Example |
| 5 |
Catastrophic data loss |
DB corruption, no backup |
| 4 |
Major service disruption |
Engine down > 1 hour |
| 3 |
Degraded service |
High latency, partial outage |
| 2 |
Minor user-facing issue |
Single feature unavailable |
| 1 |
Cosmetic |
Log noise |
Likelihood scale
| Lik |
Description |
Frequency |
| 5 |
Almost certain |
Daily |
| 4 |
Likely |
Weekly |
| 3 |
Possible |
Monthly |
| 2 |
Unlikely |
Yearly |
| 1 |
Rare |
Multi-year |
Risk Priority Number (RPN) = Severity × Likelihood. Address high-RPN first.
Engine failures
F1: Engine process crash (OOM, panic, segfault)
| Aspect |
Detail |
| Detection |
Health endpoint 503, process exit |
| Effects |
All in-flight requests fail. New requests fail until restart. |
| Severity |
4 |
| Likelihood |
3 |
| RPN |
12 |
| Mitigation |
Process supervisor (systemd, Docker restart policy, Kubernetes) auto-restart. Memory limits enforced. |
| Recovery |
Auto-restart en 5-30s. Recovery scans command_log para resume processing (per concepts/api-engine-serialization). Pending timers re-loaded (per concepts/timer-recovery-postgres). |
| RTO |
30s |
| RPO |
0 (commands en log persist) |
| Runbook |
1. Check logs for crash cause. 2. Verify auto-restart succeeded. 3. If repeated: investigate OOM/memory leak. 4. Alert if down > 5 min. |
F2: Engine processing loop stuck
| Aspect |
Detail |
| Detection |
last_processed_position no avanza. Queue depth growing. |
| Effects |
New commands queue up, no processing. Eventually backpressure trips. |
| Severity |
4 |
| Likelihood |
2 |
| RPN |
8 |
| Mitigation |
Watchdog timer: if no progress for 60s, log warning and restart. Bounded operations (timeouts on DB calls, etc.). |
| Recovery |
Restart engine. Investigate via traces what processor stuck. |
| RTO |
5 min |
| RPO |
0 |
| Runbook |
1. Check engine.processing.position metric. 2. If stuck > 60s, restart. 3. Pull stack trace before restart for diagnosis. |
F3: Replay determinism violated (data inconsistency)
| Aspect |
Detail |
| Detection |
ContinuouslyReplayTest fails in CI. State divergence after failover. |
| Effects |
Followers/replicas have different state than primary. Catastrophic in cluster. |
| Severity |
5 |
| Likelihood |
1 |
| RPN |
5 |
| Mitigation |
Property-based testing in CI (ADR-019). Code reviews enforce determinism rules (no time.now() direct, etc.). |
| Recovery |
Halt engine. Identify divergent state. Manual reconciliation. Possible: restore from backup, replay log. |
| RTO |
Hours to days |
| RPO |
0 if backup recent |
| Runbook |
This is a P0. Page on-call immediately. Investigate carefully — don't rush recovery. |
F4: Memory leak in engine
| Aspect |
Detail |
| Detection |
Memory metric trending up over hours/days. OOM eventually. |
| Effects |
OOM crash. Service disruption during restart. |
| Severity |
3 |
| Likelihood |
3 |
| RPN |
9 |
| Mitigation |
Memory profiling in CI (e.g., pprof in Go). Bounded caches. Periodic process restart (cron, weekly) as defense. |
| Recovery |
Restart. Profile to find leak. Fix code. |
| RTO |
30s (auto-restart) |
| RPO |
0 |
| Runbook |
If recurring: schedule weekly proactive restart until fix deployed. |
Database failures
F5: Postgres primary failure
| Aspect |
Detail |
| Detection |
Patroni health checks fail. Replicas notice. |
| Effects |
All writes blocked. Reads may work on replicas. |
| Severity |
4 |
| Likelihood |
2 |
| RPN |
8 |
| Mitigation |
Patroni automatic failover (ADR-020). Replicas continuously syncing. |
| Recovery |
Patroni promotes replica → ~30s downtime. PgBouncer reconnects to new primary. |
| RTO |
30s |
| RPO |
~5s (WAL streaming lag) |
| Runbook |
1. Verify Patroni promoted replica. 2. Check failed primary cause (disk full, hardware?). 3. Re-introduce as replica after fix. |
F6: Postgres replication lag high
| Aspect |
Detail |
| Detection |
pg_stat_replication.lag_bytes exceeds threshold |
| Effects |
Read replicas serve stale data. Failover would lose data. |
| Severity |
3 |
| Likelihood |
3 |
| RPN |
9 |
| Mitigation |
Alert if lag > 100 MB. Tune network. Increase WAL replay parallelism. |
| Recovery |
Investigate cause: network slow? Replica overloaded? Add more replicas? |
| RTO |
N/A (degraded) |
| RPO |
Variable based on lag |
| Runbook |
Alert: pg_replication_lag_bytes > 100000000. Investigate: replica CPU/IO bottleneck. |
F7: Database disk full
| Aspect |
Detail |
| Detection |
Disk usage > 90%. Postgres may go read-only at 95%+. |
| Effects |
Writes fail. Engine cannot persist commands. Service down. |
| Severity |
5 |
| Likelihood |
2 |
| RPN |
10 |
| Mitigation |
Disk usage monitoring (alert at 80%). Auto-extend cloud volumes. Retention policy enforced (per concepts/command-log-compaction). |
| Recovery |
1. Extend disk (if cloud, immediate). 2. Drop old partitions (concepts/command-log-compaction). 3. Archive to S3. |
| RTO |
Minutes to hours |
| RPO |
0 |
| Runbook |
URGENT: alert at 80%, P1 at 90%, P0 at 95%. Auto-extend if possible. |
F8: Database corruption
| Aspect |
Detail |
| Detection |
Postgres data-checksums errors. Queries return errors. |
| Effects |
Some data unreadable. May be silent until accessed. |
| Severity |
5 |
| Likelihood |
1 |
| RPN |
5 |
| Mitigation |
Enable data-checksums at initdb. Hardware ECC RAM. RAID for disks. Regular backups verified. |
| Recovery |
Restore from last known good backup (PITR with pgBackRest). Lose data after backup point. |
| RTO |
Hours |
| RPO |
Up to backup interval (typically 5-15 min with WAL streaming) |
| Runbook |
P0. Page DBA. Halt writes (read-only mode). Restore from backup. Notify affected tenants. |
F9: Connection pool exhausted
| Aspect |
Detail |
| Detection |
New connections rejected. App errors. |
| Effects |
New requests fail. Existing connections continue. |
| Severity |
3 |
| Likelihood |
3 |
| RPN |
9 |
| Mitigation |
PgBouncer with proper sizing (pool_size). Application connection pool limits. |
| Recovery |
Identify what holding connections. Kill long queries. Restart engine if needed. |
| RTO |
Minutes |
| RPO |
0 |
| Runbook |
1. SELECT count(*) FROM pg_stat_activity GROUP BY state. 2. Kill long-running queries. 3. Investigate why connections leak. |
F10: Postgres query degradation (slow queries)
| Aspect |
Detail |
| Detection |
pg_stat_statements shows high mean_time for common queries. Tail latency increases. |
| Effects |
High latency. Queue depth grows. Potential timeout cascade. |
| Severity |
3 |
| Likelihood |
4 |
| RPN |
12 |
| Mitigation |
Indexes from day 1 (per schema designs). pg_stat_statements monitoring. Regular ANALYZE. Vacuum tuning. |
| Recovery |
Identify slow query. Add missing index. Rewrite query. Increase resources. |
| RTO |
Hours |
| RPO |
0 |
| Runbook |
Monitor concepts/postgres-monitoring queries weekly. Top slow queries: review and optimize. |
Network failures
F11: Network partition between engine and DB
| Aspect |
Detail |
| Detection |
Connection failures spike. |
| Effects |
Engine cannot process. Health endpoint may fail. |
| Severity |
4 |
| Likelihood |
2 |
| RPN |
8 |
| Mitigation |
Retry with backoff in DB client. Circuit breaker. Multiple AZs. |
| Recovery |
Network restoration. Engine resumes processing. |
| RTO |
Minutes |
| RPO |
0 |
| Runbook |
Verify infrastructure (cloud provider status, network policies). |
F12: Network partition between engine instances (Phase 2+)
| Aspect |
Detail |
| Detection |
Leader election thrashing. Multiple "leaders" in logs. |
| Effects |
Potential split-brain. State inconsistency risk. |
| Severity |
5 |
| Likelihood |
1 |
| RPN |
5 |
| Mitigation |
Postgres advisory locks via single source of truth (DB primary). Cluster size odd number to maintain quorum. |
| Recovery |
Network healing. One leader emerges. Investigate split-brain effects in audit log. |
| RTO |
Minutes |
| RPO |
0 (Postgres serializes) |
| Runbook |
Verify only one engine is leader (SELECT pg_advisory_lock_status). |
Worker failures
F13: Worker crashes mid-job
| Aspect |
Detail |
| Detection |
Job timeout expires without completion. |
| Effects |
Job re-distributed to another worker. Possible duplicate work (per ADR-007). |
| Severity |
2 |
| Likelihood |
4 |
| RPN |
8 |
| Mitigation |
Worker idempotency required (ADR-007). Activation versioning. Reasonable timeouts. |
| Recovery |
Engine auto-reactivates job. Other worker picks up. |
| RTO |
Equal to job timeout (~30s) |
| RPO |
0 |
| Runbook |
Investigate worker crash cause if frequent. |
F14: Worker hangs indefinitely
| Aspect |
Detail |
| Detection |
Job timeout expires. Worker still "running" but unresponsive. |
| Effects |
Same as F13 + worker resource leak. |
| Severity |
3 |
| Likelihood |
3 |
| RPN |
9 |
| Mitigation |
Worker SDK enforces timeouts (kill async tasks after N seconds). Health checks for worker process. |
| Recovery |
Restart worker. Job continues elsewhere. |
| RTO |
Job timeout |
| RPO |
0 |
| Runbook |
Investigate worker hang pattern. May indicate external API issue. |
F15: Worker not idempotent → duplicate side effects
| Aspect |
Detail |
| Detection |
User reports double charges, duplicate emails. Audit log shows duplicate completions. |
| Effects |
Real-world impact (financial, customer experience). |
| Severity |
4 |
| Likelihood |
3 |
| RPN |
12 |
| Mitigation |
Worker idempotency mandatory (ADR-007). Code review checks. Documentation. |
| Recovery |
Identify affected operations. Manual remediation (refund, apology). Fix worker code. |
| RTO |
Hours to days |
| RPO |
N/A |
| Runbook |
Audit duplicate processing patterns in worker logs. Implement idempotency immediately. |
BPMN edge cases
| Aspect |
Detail |
| Detection |
Deployment endpoint returns 400 with details. |
| Effects |
User cannot deploy. No production impact. |
| Severity |
1 |
| Likelihood |
3 |
| RPN |
3 |
| Mitigation |
Validation (XSD + semantic + MVP-specific). Helpful error messages. |
| Recovery |
User fixes BPMN and retries. |
| RTO |
N/A |
| RPO |
N/A |
| Runbook |
N/A (user issue). |
F17: BPMN with unsupported elements
| Aspect |
Detail |
| Detection |
Deployment rejected with element list. |
| Effects |
User cannot deploy that BPMN. |
| Severity |
2 |
| Likelihood |
4 |
| RPN |
8 |
| Mitigation |
Documented supported subset. Conversion tool from Camunda BPMN. |
| Recovery |
User modifies BPMN to use supported elements. |
| RTO |
N/A |
| RPO |
N/A |
| Runbook |
Direct user to supported elements docs. |
F18: FEEL/CEL expression evaluation fails
| Aspect |
Detail |
| Detection |
Incident created (EXTRACT_VALUE_ERROR). |
| Effects |
Process instance stuck pending incident resolution. |
| Severity |
2 |
| Likelihood |
4 |
| RPN |
8 |
| Mitigation |
Expression validation at deploy time. Helpful error messages including variable context. |
| Recovery |
Admin resolves incident: fix variable values, retry. Or cancel instance. |
| RTO |
Minutes (manual) |
| RPO |
N/A |
| Runbook |
Operate-equivalent shows incident details. Admin reviews and resolves. |
F19: Infinite loop in BPMN
| Aspect |
Detail |
| Detection |
Per-instance event count > threshold. Memory/storage growth. |
| Effects |
One instance consumes huge resources. Other instances OK. |
| Severity |
3 |
| Likelihood |
2 |
| RPN |
6 |
| Mitigation |
Detection: > 10K events per instance = warning, > 100K = auto-suspend. |
| Recovery |
Operator cancels infinite instance. Fix BPMN. |
| RTO |
Manual (hours) |
| RPO |
N/A |
| Runbook |
Alert: process_instance_events > 100000. Investigate which instance. Cancel. |
Clock issues
F20: Clock skew between engines
| Aspect |
Detail |
| Detection |
Timer fire times inconsistent. Audit logs show non-monotonic timestamps. |
| Effects |
Timer fires at wrong time. Replay determinism issues (engine fires before instructed). |
| Severity |
4 |
| Likelihood |
2 |
| RPN |
8 |
| Mitigation |
NTP/chrony on all hosts. Use DB time (NOW()) as authoritative when possible. Clock alert if > 1s drift. |
| Recovery |
Force NTP sync. Investigate hardware issues. |
| RTO |
Minutes |
| RPO |
0 |
| Runbook |
Monitor node_timex_offset_seconds. Alert > 1s. |
F21: Clock jumps backward
| Aspect |
Detail |
| Detection |
Timestamps in DB go backwards. Engine may panic. |
| Effects |
Replay non-deterministic. Multiple weird bugs. |
| Severity |
5 |
| Likelihood |
1 |
| RPN |
5 |
| Mitigation |
Monotonic clock used for durations (not wall clock). NTP slew mode (gradual adjustment). |
| Recovery |
Halt engine. Fix clock source. Investigate damage. |
| RTO |
Hours |
| RPO |
0 |
| Runbook |
P0. This is dangerous. Halt and investigate. |
Cascading failures
F22: Slow worker → timer fire storm → engine overload
Chain:
1. Worker pool slow processing (overloaded external API)
2. Jobs timeout → re-activated
3. Timers fire on retry logic → more commands
4. Engine queue grows → backpressure trips
5. New legitimate work rejected
Mitigation: separate worker pools per job type. Per-job-type rate limiting. Circuit breaker for external services in workers.
F23: DB slow → engine queue grows → OOM
Chain:
1. DB query degrades (missing index)
2. Engine processing slows down
3. Commands queue grows in memory
4. Memory pressure → swap → slower
5. OOM → restart → backlog → repeat
Mitigation: bounded queue (per concepts/backpressure-rest-strategy). Aggressive DB query monitoring. Alerts on queue depth.
F24: Cache invalidation storm
Chain:
1. Multiple deploys in short time
2. Each invalidates process_definition cache
3. Engine re-parses repeatedly
4. CPU saturated
5. All processing slows
Mitigation: rate limit deploys (1 per second per tenant). Cache also has TTL (not just invalidation-based).
Recovery procedures
Common procedures
Drain mode (graceful shutdown)
# Stop accepting new commands
mvp-cli engine drain --instance=engine-1
# Wait for queue to drain
mvp-cli engine wait-for-quiet --instance=engine-1 --timeout=5m
# Now safe to restart/maintenance
systemctl stop engine
Catastrophic recovery (PITR)
# 1. Halt all engines
mvp-cli cluster halt
# 2. Restore Postgres to PITR
pgbackrest --stanza=main --type=time --target="2025-05-14 10:00:00" restore
# 3. Replay command_log from logs after restore point (if any survived)
mvp-cli engine replay --from-position=$LAST_COMMITTED
# 4. Resume normal operation
mvp-cli cluster start
Tenant-specific issue
# Suspend tenant (no new operations accepted)
mvp-cli tenant suspend acme --reason="investigation"
# Investigate via audit log
mvp-cli audit query --tenant=acme --since="2025-05-14 09:00" --action="*"
# Resolve issue
# ...
# Resume
mvp-cli tenant resume acme
Priority matrix
Ordering by RPN (highest priority first):
| F# |
Failure |
RPN |
Phase |
| F1 |
Engine crash |
12 |
Mitigate Phase 0 |
| F10 |
Slow queries |
12 |
Mitigate Phase 0 |
| F15 |
Worker non-idempotent |
12 |
Education ongoing |
| F7 |
Disk full |
10 |
Mitigate Phase 0 |
| F4 |
Memory leak |
9 |
Mitigate Phase 1 |
| F6 |
Replication lag |
9 |
Mitigate Phase 1 |
| F9 |
Connection exhausted |
9 |
Mitigate Phase 0 |
| F14 |
Worker hangs |
9 |
Mitigate Phase 0 |
| F5 |
DB primary fails |
8 |
Phase 1 (Patroni) |
| F11 |
Network partition DB |
8 |
Phase 1 |
| F13 |
Worker crash |
8 |
Mitigate Phase 0 |
| F17 |
Unsupported BPMN |
8 |
Education |
| F18 |
Expression fails |
8 |
Mitigate Phase 0 |
| F20 |
Clock skew |
8 |
Mitigate Phase 0 |
| F19 |
Infinite loop |
6 |
Phase 1 |
| F8 |
DB corruption |
5 |
Mitigate Phase 0 |
| F3 |
Replay determinism |
5 |
Test Phase 0 |
| F12 |
Engine split-brain |
5 |
Phase 2 design |
| F21 |
Clock jumps back |
5 |
Mitigate Phase 0 |
| F2 |
Engine stuck |
8 |
Mitigate Phase 0 |
| F16 |
Malformed BPMN |
3 |
Documentation |
Resulting actions
Phase 0 mitigations (must have):
Phase 1 additions:
Links