Command Log Compaction
Command + event log crece infinitamente sin compaction. 1M instances × 50 events × 1KB = 50GB solo events. Estrategia MVP: archivado por partitioning de Postgres (pg_partman) + retention policy configurable + cold storage en S3 para compliance/audit. Replay determinism preserved si retention > tiempo de longest-running instance. Sin truncation activa requerida hasta Phase 3. Phase 3+ requiere active strategy.
El problema¶
Per ADR-005, MVP usa command sourcing — TODO command y event persisten en log:
CREATE TABLE command_log (
position BIGSERIAL PRIMARY KEY,
timestamp TIMESTAMPTZ NOT NULL,
tenant_id TEXT NOT NULL,
intent TEXT NOT NULL,
payload JSONB NOT NULL
);
CREATE TABLE event_log (
position BIGSERIAL PRIMARY KEY,
command_position BIGINT,
timestamp TIMESTAMPTZ NOT NULL,
tenant_id TEXT NOT NULL,
intent TEXT NOT NULL,
payload JSONB NOT NULL
);
Storage growth realista:
Per process instance (avg):
- ~50 events (CREATE, ACTIVATE × N elements, COMPLETE × N, etc.)
- Each event ~1 KB (JSONB compresses, but realistic)
Total: ~50 KB
1M instances/month: 50 GB
10M instances/month: 500 GB
100M instances/month: 5 TB
After 1 year:
- 1M/month: 600 GB
- 10M/month: 6 TB (Postgres single-node limit)
- 100M/month: 60 TB (Phase 4+ needed)
Sin strategy, hit storage limits eventually.
Camunda's approach: snapshots + log compaction¶
Camunda: 1. Periodically takes snapshot of RocksDB state 2. After snapshot, truncates log up to snapshot position 3. Log + snapshot = full state recoverable
MVP no tiene snapshots (Postgres replication is the recovery mechanism). Diferente strategy needed.
MVP strategy: time-based partitioning + archival¶
Concept: log es time-series data¶
Insight: command_log y event_log son time-series data. Patterns: - Writes son siempre at "now" (append-only) - Reads de history son rare después de instance completion - Most queries son recent (last hour / day)
Postgres handles time-series via table partitioning.
Schema con partitioning¶
-- Use pg_partman extension
CREATE EXTENSION pg_partman;
-- Partition command_log by month
CREATE TABLE command_log (
position BIGSERIAL,
timestamp TIMESTAMPTZ NOT NULL,
tenant_id TEXT NOT NULL,
intent TEXT NOT NULL,
payload JSONB NOT NULL,
PRIMARY KEY (position, timestamp)
) PARTITION BY RANGE (timestamp);
-- Setup pg_partman to auto-create monthly partitions
SELECT partman.create_parent(
p_parent_table => 'public.command_log',
p_control => 'timestamp',
p_type => 'native',
p_interval => '1 month',
p_premake => 3 -- create next 3 months ahead
);
-- Run periodically (cron)
SELECT partman.run_maintenance();
Same for event_log.
Partitions resultantes¶
command_log
├── command_log_y2025m01 (Jan 2025)
├── command_log_y2025m02 (Feb 2025)
├── command_log_y2025m03 (Mar 2025)
├── command_log_y2025m04 (Apr 2025 - current)
└── command_log_y2025m05 (May 2025 - future)
Beneficios: - Inserts go to current partition (fast) - Queries with timestamp filter only scan relevant partitions (pruning) - Old partitions can be archived/dropped as single DDL operation (fast)
Retention policy¶
Three tiers¶
flowchart TD
T1[Tier 1: Hot - Postgres main, last 90 days<br/>Query latency: ms<br/>Cost: high SSD<br/>Active monitoring, recent debug]
T2[Tier 2: Warm - Postgres archive table<br/>Query latency: seconds<br/>Cost: medium<br/>Forensics, compliance audit]
T3[Tier 3: Cold - S3, Glacier<br/>Query latency: minutes-hours<br/>Cost: very low<br/>Long-term compliance only]
T1 -->|after 90 days| T2
T2 -->|after 1 year| T3
Implementation¶
# Daily cron job
async def daily_retention_job():
# Archive 90+ day old partitions to cold storage
cutoff_date = datetime.utcnow() - timedelta(days=90)
old_partitions = await get_partitions_older_than('command_log', cutoff_date)
for partition in old_partitions:
# Export to S3
await pg_dump_partition_to_s3(
partition,
bucket='workflow-archives',
key=f"{partition.name}.sql.gz"
)
# Drop partition from Postgres
await db.execute(f"ALTER TABLE command_log DETACH PARTITION {partition.name}")
await db.execute(f"DROP TABLE {partition.name}")
# Similar for event_log
Configurable retention¶
CREATE TABLE retention_policies (
tenant_id TEXT PRIMARY KEY,
hot_retention_days INT DEFAULT 90,
warm_retention_days INT DEFAULT 365,
archive_to_cold BOOLEAN DEFAULT TRUE,
archive_bucket TEXT
);
Compliance-heavy tenants: retention years. Free tier: 30 days.
Replay determinism constraint¶
Critical: per ADR-019, replay determinism requires log integrity. Pero solo desde el inicio del longest-running instance.
flowchart TD
S0["State at T=0: empty"] --> C["Commands C1..Cn: t=0...t=now"]
C --> SN["Snapshot of state at T=now<br/>derivable from C1..Cn"]
SN --> X["If oldest active instance started at t=X,<br/>need commands from t=X onwards to replay.<br/>Commands before t=X involve only completed instances<br/>(safe to truncate)"]
Find safe truncation point¶
-- Oldest active instance
SELECT MIN(start_date) AS oldest_active
FROM process_instances
WHERE state IN ('ACTIVE', 'INCIDENT');
-- Result: 2025-03-15 (e.g.)
-- Safe to archive partitions < 2025-03-15 que contengan SOLO completed instances
Conservative approach¶
Don't truncate active instances. Per-instance threshold:
async def can_archive_partition(partition):
"""Check if partition can be archived safely."""
cutoff = partition.timestamp_range_end
# Are there active instances that started BEFORE this partition's end?
active_count = await db.fetch_val("""
SELECT COUNT(*) FROM process_instances
WHERE state IN ('ACTIVE', 'INCIDENT')
AND start_date < $1
""", cutoff)
if active_count > 0:
return False, f"{active_count} active instances pre-date this partition"
return True, "safe"
Solo archive partitions cuando NO hay active instances pre-dating them.
Worst case: long-running process¶
Si tienes process que dura 6 meses, retention efectiva es 6+ months minimum. Document esta restriction.
Archival format¶
S3 archives son SQL dumps compressed:
pg_dump --table=command_log_y2024m12 \
--format=custom \
--compress=9 \
workflow_db > command_log_y2024m12.dump
aws s3 cp command_log_y2024m12.dump \
s3://workflow-archives/2024/12/command_log.dump
Compression ratio for JSONB ~3-5x. 60 GB partition → ~12-20 GB archive.
S3 Glacier para >1 year old: $0.004/GB/month. 1 TB = $4/month.
Restore from cold storage¶
For compliance audits or forensics:
mvp-cli archive restore --partition=2024-12 \
--to=command_log_y2024m12_restored
# Engine doesn't see it directly, but admin can query
SELECT * FROM command_log_y2024m12_restored
WHERE tenant_id = 'acme' AND payload->>'processInstanceKey' = '12345';
Restored tables NOT attached to main partition tree — separate read-only tables.
Per-tenant tracking¶
CREATE TABLE retention_tracking (
tenant_id TEXT,
partition_name TEXT,
archived_at TIMESTAMPTZ,
archive_location TEXT,
archive_size_bytes BIGINT,
PRIMARY KEY (tenant_id, partition_name)
);
Audit log de qué se archivó cuándo.
Compliance considerations¶
GDPR right to deletion¶
Cliente requests deletion of all data for an EU user:
-- Find process instances for this user
DELETE FROM process_instances WHERE variables->>'user_id' = $user_id;
-- Also need to delete from log
DELETE FROM command_log WHERE payload->>'user_id' = $user_id;
DELETE FROM event_log WHERE payload->>'user_id' = $user_id;
Problem: if archived to S3, must also delete from S3.
Implementation:
- Maintain user_id index on archived chunks (metadata in main DB)
- Cron job for delete-from-archive periodically
Alternative: don't archive PII — strip from variables before archival. Maintain reference to user_id, not full PII.
Audit retention requirements¶
Industries: - SOX: 7 years - HIPAA: 6 years - Banking: 7-10 years
Configure retention para meet requirements:
INSERT INTO retention_policies (tenant_id, ...)
VALUES ('healthcare-client', 90, 2190, true); -- 6 years warm
Cold storage hace este cost-feasible.
Cleanup state tables also¶
Beyond logs, state tables also need cleanup:
-- Old completed process_instances also waste space
-- These are CRUD state derived from log
-- After log archival, can also archive state
DELETE FROM process_instances
WHERE state IN ('COMPLETED', 'CANCELED')
AND end_date < NOW() - INTERVAL '90 days'
AND tenant_id IN (SELECT tenant_id FROM retention_policies WHERE hot_retention_days <= 90);
State recoverable from log if needed (replay).
Monitoring¶
-- Storage por tenant
SELECT
tenant_id,
pg_size_pretty(SUM(pg_total_relation_size(...))) AS total_size
FROM ...
GROUP BY tenant_id
ORDER BY SUM(...) DESC;
-- Partition sizes
SELECT
schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables
WHERE tablename LIKE 'command_log_%'
ORDER BY tablename;
Alert si tenant > expected size (anomaly):
Phase 4+: Citus and archival¶
Citus shards distribute data. Archival per shard:
Citus + pg_partman work together natively.
Performance impact¶
Partition pruning is huge win:
-- Query con timestamp filter
SELECT * FROM event_log
WHERE timestamp > NOW() - INTERVAL '1 day'
AND tenant_id = 'acme';
-- Postgres only scans last partition. Old partitions ignored.
For queries en old data (rare), full scan. Acceptable tradeoff.
Decision summary¶
| Aspect | Decision | Trigger |
|---|---|---|
| Partitioning | pg_partman monthly | Day 1 |
| Hot retention | 90 days default | Day 1 |
| Warm retention | 365 days default | Day 1 |
| Cold storage | S3 + Glacier | Day 1 |
| Active truncation | NO until Phase 3+ | When > 1 TB |
| Per-tenant config | Yes | Day 1 |
Links¶
- adrs/adr-005-stream-processing-command-sourcing — Foundation
- adrs/adr-019-replay-determinism-invariant — Constraint on truncation
- concepts/postgres-monitoring — Storage monitoring queries
- pg_partman
- WAL-G archival