Saltar a contenido

Failure Mode Analysis

Análisis sistemático de modes of failure del MVP per componente. Para cada failure: detection mechanism, blast radius, RTO/RPO objectives, mitigation strategy, recovery procedure. Cubre: engine crash, DB failures, network partitions, worker failures, BPMN edge cases, data corruption, clock issues. Generates runbook entries reusables como base para SRE on-call.

Metodología FMEA

Por cada componente: 1. Identify failure modes (qué puede fallar) 2. Effects (qué impacto tiene) 3. Detection (cómo se descubre) 4. Severity (1-5 scale) 5. Likelihood (1-5 scale) 6. Mitigation (cómo prevent/reduce) 7. Recovery (qué hacer cuando ocurre)

Severity scale

Sev Description Example
5 Catastrophic data loss DB corruption, no backup
4 Major service disruption Engine down > 1 hour
3 Degraded service High latency, partial outage
2 Minor user-facing issue Single feature unavailable
1 Cosmetic Log noise

Likelihood scale

Lik Description Frequency
5 Almost certain Daily
4 Likely Weekly
3 Possible Monthly
2 Unlikely Yearly
1 Rare Multi-year

Risk Priority Number (RPN) = Severity × Likelihood. Address high-RPN first.

Engine failures

F1: Engine process crash (OOM, panic, segfault)

Aspect Detail
Detection Health endpoint 503, process exit
Effects All in-flight requests fail. New requests fail until restart.
Severity 4
Likelihood 3
RPN 12
Mitigation Process supervisor (systemd, Docker restart policy, Kubernetes) auto-restart. Memory limits enforced.
Recovery Auto-restart en 5-30s. Recovery scans command_log para resume processing (per concepts/api-engine-serialization). Pending timers re-loaded (per concepts/timer-recovery-postgres).
RTO 30s
RPO 0 (commands en log persist)
Runbook 1. Check logs for crash cause. 2. Verify auto-restart succeeded. 3. If repeated: investigate OOM/memory leak. 4. Alert if down > 5 min.

F2: Engine processing loop stuck

Aspect Detail
Detection last_processed_position no avanza. Queue depth growing.
Effects New commands queue up, no processing. Eventually backpressure trips.
Severity 4
Likelihood 2
RPN 8
Mitigation Watchdog timer: if no progress for 60s, log warning and restart. Bounded operations (timeouts on DB calls, etc.).
Recovery Restart engine. Investigate via traces what processor stuck.
RTO 5 min
RPO 0
Runbook 1. Check engine.processing.position metric. 2. If stuck > 60s, restart. 3. Pull stack trace before restart for diagnosis.

F3: Replay determinism violated (data inconsistency)

Aspect Detail
Detection ContinuouslyReplayTest fails in CI. State divergence after failover.
Effects Followers/replicas have different state than primary. Catastrophic in cluster.
Severity 5
Likelihood 1
RPN 5
Mitigation Property-based testing in CI (ADR-019). Code reviews enforce determinism rules (no time.now() direct, etc.).
Recovery Halt engine. Identify divergent state. Manual reconciliation. Possible: restore from backup, replay log.
RTO Hours to days
RPO 0 if backup recent
Runbook This is a P0. Page on-call immediately. Investigate carefully — don't rush recovery.

F4: Memory leak in engine

Aspect Detail
Detection Memory metric trending up over hours/days. OOM eventually.
Effects OOM crash. Service disruption during restart.
Severity 3
Likelihood 3
RPN 9
Mitigation Memory profiling in CI (e.g., pprof in Go). Bounded caches. Periodic process restart (cron, weekly) as defense.
Recovery Restart. Profile to find leak. Fix code.
RTO 30s (auto-restart)
RPO 0
Runbook If recurring: schedule weekly proactive restart until fix deployed.

Database failures

F5: Postgres primary failure

Aspect Detail
Detection Patroni health checks fail. Replicas notice.
Effects All writes blocked. Reads may work on replicas.
Severity 4
Likelihood 2
RPN 8
Mitigation Patroni automatic failover (ADR-020). Replicas continuously syncing.
Recovery Patroni promotes replica → ~30s downtime. PgBouncer reconnects to new primary.
RTO 30s
RPO ~5s (WAL streaming lag)
Runbook 1. Verify Patroni promoted replica. 2. Check failed primary cause (disk full, hardware?). 3. Re-introduce as replica after fix.

F6: Postgres replication lag high

Aspect Detail
Detection pg_stat_replication.lag_bytes exceeds threshold
Effects Read replicas serve stale data. Failover would lose data.
Severity 3
Likelihood 3
RPN 9
Mitigation Alert if lag > 100 MB. Tune network. Increase WAL replay parallelism.
Recovery Investigate cause: network slow? Replica overloaded? Add more replicas?
RTO N/A (degraded)
RPO Variable based on lag
Runbook Alert: pg_replication_lag_bytes > 100000000. Investigate: replica CPU/IO bottleneck.

F7: Database disk full

Aspect Detail
Detection Disk usage > 90%. Postgres may go read-only at 95%+.
Effects Writes fail. Engine cannot persist commands. Service down.
Severity 5
Likelihood 2
RPN 10
Mitigation Disk usage monitoring (alert at 80%). Auto-extend cloud volumes. Retention policy enforced (per concepts/command-log-compaction).
Recovery 1. Extend disk (if cloud, immediate). 2. Drop old partitions (concepts/command-log-compaction). 3. Archive to S3.
RTO Minutes to hours
RPO 0
Runbook URGENT: alert at 80%, P1 at 90%, P0 at 95%. Auto-extend if possible.

F8: Database corruption

Aspect Detail
Detection Postgres data-checksums errors. Queries return errors.
Effects Some data unreadable. May be silent until accessed.
Severity 5
Likelihood 1
RPN 5
Mitigation Enable data-checksums at initdb. Hardware ECC RAM. RAID for disks. Regular backups verified.
Recovery Restore from last known good backup (PITR with pgBackRest). Lose data after backup point.
RTO Hours
RPO Up to backup interval (typically 5-15 min with WAL streaming)
Runbook P0. Page DBA. Halt writes (read-only mode). Restore from backup. Notify affected tenants.

F9: Connection pool exhausted

Aspect Detail
Detection New connections rejected. App errors.
Effects New requests fail. Existing connections continue.
Severity 3
Likelihood 3
RPN 9
Mitigation PgBouncer with proper sizing (pool_size). Application connection pool limits.
Recovery Identify what holding connections. Kill long queries. Restart engine if needed.
RTO Minutes
RPO 0
Runbook 1. SELECT count(*) FROM pg_stat_activity GROUP BY state. 2. Kill long-running queries. 3. Investigate why connections leak.

F10: Postgres query degradation (slow queries)

Aspect Detail
Detection pg_stat_statements shows high mean_time for common queries. Tail latency increases.
Effects High latency. Queue depth grows. Potential timeout cascade.
Severity 3
Likelihood 4
RPN 12
Mitigation Indexes from day 1 (per schema designs). pg_stat_statements monitoring. Regular ANALYZE. Vacuum tuning.
Recovery Identify slow query. Add missing index. Rewrite query. Increase resources.
RTO Hours
RPO 0
Runbook Monitor concepts/postgres-monitoring queries weekly. Top slow queries: review and optimize.

Network failures

F11: Network partition between engine and DB

Aspect Detail
Detection Connection failures spike.
Effects Engine cannot process. Health endpoint may fail.
Severity 4
Likelihood 2
RPN 8
Mitigation Retry with backoff in DB client. Circuit breaker. Multiple AZs.
Recovery Network restoration. Engine resumes processing.
RTO Minutes
RPO 0
Runbook Verify infrastructure (cloud provider status, network policies).

F12: Network partition between engine instances (Phase 2+)

Aspect Detail
Detection Leader election thrashing. Multiple "leaders" in logs.
Effects Potential split-brain. State inconsistency risk.
Severity 5
Likelihood 1
RPN 5
Mitigation Postgres advisory locks via single source of truth (DB primary). Cluster size odd number to maintain quorum.
Recovery Network healing. One leader emerges. Investigate split-brain effects in audit log.
RTO Minutes
RPO 0 (Postgres serializes)
Runbook Verify only one engine is leader (SELECT pg_advisory_lock_status).

Worker failures

F13: Worker crashes mid-job

Aspect Detail
Detection Job timeout expires without completion.
Effects Job re-distributed to another worker. Possible duplicate work (per ADR-007).
Severity 2
Likelihood 4
RPN 8
Mitigation Worker idempotency required (ADR-007). Activation versioning. Reasonable timeouts.
Recovery Engine auto-reactivates job. Other worker picks up.
RTO Equal to job timeout (~30s)
RPO 0
Runbook Investigate worker crash cause if frequent.

F14: Worker hangs indefinitely

Aspect Detail
Detection Job timeout expires. Worker still "running" but unresponsive.
Effects Same as F13 + worker resource leak.
Severity 3
Likelihood 3
RPN 9
Mitigation Worker SDK enforces timeouts (kill async tasks after N seconds). Health checks for worker process.
Recovery Restart worker. Job continues elsewhere.
RTO Job timeout
RPO 0
Runbook Investigate worker hang pattern. May indicate external API issue.

F15: Worker not idempotent → duplicate side effects

Aspect Detail
Detection User reports double charges, duplicate emails. Audit log shows duplicate completions.
Effects Real-world impact (financial, customer experience).
Severity 4
Likelihood 3
RPN 12
Mitigation Worker idempotency mandatory (ADR-007). Code review checks. Documentation.
Recovery Identify affected operations. Manual remediation (refund, apology). Fix worker code.
RTO Hours to days
RPO N/A
Runbook Audit duplicate processing patterns in worker logs. Implement idempotency immediately.

BPMN edge cases

F16: Malformed BPMN deployment

Aspect Detail
Detection Deployment endpoint returns 400 with details.
Effects User cannot deploy. No production impact.
Severity 1
Likelihood 3
RPN 3
Mitigation Validation (XSD + semantic + MVP-specific). Helpful error messages.
Recovery User fixes BPMN and retries.
RTO N/A
RPO N/A
Runbook N/A (user issue).

F17: BPMN with unsupported elements

Aspect Detail
Detection Deployment rejected with element list.
Effects User cannot deploy that BPMN.
Severity 2
Likelihood 4
RPN 8
Mitigation Documented supported subset. Conversion tool from Camunda BPMN.
Recovery User modifies BPMN to use supported elements.
RTO N/A
RPO N/A
Runbook Direct user to supported elements docs.

F18: FEEL/CEL expression evaluation fails

Aspect Detail
Detection Incident created (EXTRACT_VALUE_ERROR).
Effects Process instance stuck pending incident resolution.
Severity 2
Likelihood 4
RPN 8
Mitigation Expression validation at deploy time. Helpful error messages including variable context.
Recovery Admin resolves incident: fix variable values, retry. Or cancel instance.
RTO Minutes (manual)
RPO N/A
Runbook Operate-equivalent shows incident details. Admin reviews and resolves.

F19: Infinite loop in BPMN

Aspect Detail
Detection Per-instance event count > threshold. Memory/storage growth.
Effects One instance consumes huge resources. Other instances OK.
Severity 3
Likelihood 2
RPN 6
Mitigation Detection: > 10K events per instance = warning, > 100K = auto-suspend.
Recovery Operator cancels infinite instance. Fix BPMN.
RTO Manual (hours)
RPO N/A
Runbook Alert: process_instance_events > 100000. Investigate which instance. Cancel.

Clock issues

F20: Clock skew between engines

Aspect Detail
Detection Timer fire times inconsistent. Audit logs show non-monotonic timestamps.
Effects Timer fires at wrong time. Replay determinism issues (engine fires before instructed).
Severity 4
Likelihood 2
RPN 8
Mitigation NTP/chrony on all hosts. Use DB time (NOW()) as authoritative when possible. Clock alert if > 1s drift.
Recovery Force NTP sync. Investigate hardware issues.
RTO Minutes
RPO 0
Runbook Monitor node_timex_offset_seconds. Alert > 1s.

F21: Clock jumps backward

Aspect Detail
Detection Timestamps in DB go backwards. Engine may panic.
Effects Replay non-deterministic. Multiple weird bugs.
Severity 5
Likelihood 1
RPN 5
Mitigation Monotonic clock used for durations (not wall clock). NTP slew mode (gradual adjustment).
Recovery Halt engine. Fix clock source. Investigate damage.
RTO Hours
RPO 0
Runbook P0. This is dangerous. Halt and investigate.

Cascading failures

F22: Slow worker → timer fire storm → engine overload

Chain: 1. Worker pool slow processing (overloaded external API) 2. Jobs timeout → re-activated 3. Timers fire on retry logic → more commands 4. Engine queue grows → backpressure trips 5. New legitimate work rejected

Mitigation: separate worker pools per job type. Per-job-type rate limiting. Circuit breaker for external services in workers.

F23: DB slow → engine queue grows → OOM

Chain: 1. DB query degrades (missing index) 2. Engine processing slows down 3. Commands queue grows in memory 4. Memory pressure → swap → slower 5. OOM → restart → backlog → repeat

Mitigation: bounded queue (per concepts/backpressure-rest-strategy). Aggressive DB query monitoring. Alerts on queue depth.

F24: Cache invalidation storm

Chain: 1. Multiple deploys in short time 2. Each invalidates process_definition cache 3. Engine re-parses repeatedly 4. CPU saturated 5. All processing slows

Mitigation: rate limit deploys (1 per second per tenant). Cache also has TTL (not just invalidation-based).

Recovery procedures

Common procedures

Drain mode (graceful shutdown)

# Stop accepting new commands
mvp-cli engine drain --instance=engine-1

# Wait for queue to drain
mvp-cli engine wait-for-quiet --instance=engine-1 --timeout=5m

# Now safe to restart/maintenance
systemctl stop engine

Catastrophic recovery (PITR)

# 1. Halt all engines
mvp-cli cluster halt

# 2. Restore Postgres to PITR
pgbackrest --stanza=main --type=time --target="2025-05-14 10:00:00" restore

# 3. Replay command_log from logs after restore point (if any survived)
mvp-cli engine replay --from-position=$LAST_COMMITTED

# 4. Resume normal operation
mvp-cli cluster start

Tenant-specific issue

# Suspend tenant (no new operations accepted)
mvp-cli tenant suspend acme --reason="investigation"

# Investigate via audit log
mvp-cli audit query --tenant=acme --since="2025-05-14 09:00" --action="*"

# Resolve issue
# ...

# Resume
mvp-cli tenant resume acme

Priority matrix

Ordering by RPN (highest priority first):

F# Failure RPN Phase
F1 Engine crash 12 Mitigate Phase 0
F10 Slow queries 12 Mitigate Phase 0
F15 Worker non-idempotent 12 Education ongoing
F7 Disk full 10 Mitigate Phase 0
F4 Memory leak 9 Mitigate Phase 1
F6 Replication lag 9 Mitigate Phase 1
F9 Connection exhausted 9 Mitigate Phase 0
F14 Worker hangs 9 Mitigate Phase 0
F5 DB primary fails 8 Phase 1 (Patroni)
F11 Network partition DB 8 Phase 1
F13 Worker crash 8 Mitigate Phase 0
F17 Unsupported BPMN 8 Education
F18 Expression fails 8 Mitigate Phase 0
F20 Clock skew 8 Mitigate Phase 0
F19 Infinite loop 6 Phase 1
F8 DB corruption 5 Mitigate Phase 0
F3 Replay determinism 5 Test Phase 0
F12 Engine split-brain 5 Phase 2 design
F21 Clock jumps back 5 Mitigate Phase 0
F2 Engine stuck 8 Mitigate Phase 0
F16 Malformed BPMN 3 Documentation

Resulting actions

Phase 0 mitigations (must have):

  • Process supervisor (systemd/Docker)
  • Memory limits
  • Disk usage monitoring
  • Bounded queue
  • Connection pool sizing
  • Indexes from day 1
  • Idempotency required (ADR-007)
  • Replay determinism tests (ADR-019)
  • NTP/chrony
  • Data checksums
  • Watchdog timer in processing loop
  • Per-instance event count limit
  • Comprehensive logging
  • Health endpoints

Phase 1 additions:

  • Patroni HA (ADR-020)
  • Read replicas
  • PgBouncer
  • Replication lag alerting
  • PITR with pgBackRest