Saltar a contenido

Command Log Compaction

Command + event log crece infinitamente sin compaction. 1M instances × 50 events × 1KB = 50GB solo events. Estrategia MVP: archivado por partitioning de Postgres (pg_partman) + retention policy configurable + cold storage en S3 para compliance/audit. Replay determinism preserved si retention > tiempo de longest-running instance. Sin truncation activa requerida hasta Phase 3. Phase 3+ requiere active strategy.

El problema

Per ADR-005, MVP usa command sourcing — TODO command y event persisten en log:

CREATE TABLE command_log (
    position BIGSERIAL PRIMARY KEY,
    timestamp TIMESTAMPTZ NOT NULL,
    tenant_id TEXT NOT NULL,
    intent TEXT NOT NULL,
    payload JSONB NOT NULL
);

CREATE TABLE event_log (
    position BIGSERIAL PRIMARY KEY,
    command_position BIGINT,
    timestamp TIMESTAMPTZ NOT NULL,
    tenant_id TEXT NOT NULL,
    intent TEXT NOT NULL,
    payload JSONB NOT NULL
);

Storage growth realista:

Per process instance (avg):
  - ~50 events (CREATE, ACTIVATE × N elements, COMPLETE × N, etc.)
  - Each event ~1 KB (JSONB compresses, but realistic)
  Total: ~50 KB

1M instances/month: 50 GB
10M instances/month: 500 GB
100M instances/month: 5 TB

After 1 year:
  - 1M/month: 600 GB
  - 10M/month: 6 TB (Postgres single-node limit)
  - 100M/month: 60 TB (Phase 4+ needed)

Sin strategy, hit storage limits eventually.

Camunda's approach: snapshots + log compaction

Camunda: 1. Periodically takes snapshot of RocksDB state 2. After snapshot, truncates log up to snapshot position 3. Log + snapshot = full state recoverable

MVP no tiene snapshots (Postgres replication is the recovery mechanism). Diferente strategy needed.

MVP strategy: time-based partitioning + archival

Concept: log es time-series data

Insight: command_log y event_log son time-series data. Patterns: - Writes son siempre at "now" (append-only) - Reads de history son rare después de instance completion - Most queries son recent (last hour / day)

Postgres handles time-series via table partitioning.

Schema con partitioning

-- Use pg_partman extension
CREATE EXTENSION pg_partman;

-- Partition command_log by month
CREATE TABLE command_log (
    position BIGSERIAL,
    timestamp TIMESTAMPTZ NOT NULL,
    tenant_id TEXT NOT NULL,
    intent TEXT NOT NULL,
    payload JSONB NOT NULL,
    PRIMARY KEY (position, timestamp)
) PARTITION BY RANGE (timestamp);

-- Setup pg_partman to auto-create monthly partitions
SELECT partman.create_parent(
    p_parent_table => 'public.command_log',
    p_control => 'timestamp',
    p_type => 'native',
    p_interval => '1 month',
    p_premake => 3  -- create next 3 months ahead
);

-- Run periodically (cron)
SELECT partman.run_maintenance();

Same for event_log.

Partitions resultantes

command_log
├── command_log_y2025m01    (Jan 2025)
├── command_log_y2025m02    (Feb 2025)
├── command_log_y2025m03    (Mar 2025)
├── command_log_y2025m04    (Apr 2025 - current)
└── command_log_y2025m05    (May 2025 - future)

Beneficios: - Inserts go to current partition (fast) - Queries with timestamp filter only scan relevant partitions (pruning) - Old partitions can be archived/dropped as single DDL operation (fast)

Retention policy

Three tiers

flowchart TD
    T1[Tier 1: Hot - Postgres main, last 90 days<br/>Query latency: ms<br/>Cost: high SSD<br/>Active monitoring, recent debug]
    T2[Tier 2: Warm - Postgres archive table<br/>Query latency: seconds<br/>Cost: medium<br/>Forensics, compliance audit]
    T3[Tier 3: Cold - S3, Glacier<br/>Query latency: minutes-hours<br/>Cost: very low<br/>Long-term compliance only]
    T1 -->|after 90 days| T2
    T2 -->|after 1 year| T3

Implementation

# Daily cron job
async def daily_retention_job():
    # Archive 90+ day old partitions to cold storage
    cutoff_date = datetime.utcnow() - timedelta(days=90)

    old_partitions = await get_partitions_older_than('command_log', cutoff_date)

    for partition in old_partitions:
        # Export to S3
        await pg_dump_partition_to_s3(
            partition, 
            bucket='workflow-archives',
            key=f"{partition.name}.sql.gz"
        )

        # Drop partition from Postgres
        await db.execute(f"ALTER TABLE command_log DETACH PARTITION {partition.name}")
        await db.execute(f"DROP TABLE {partition.name}")

# Similar for event_log

Configurable retention

CREATE TABLE retention_policies (
    tenant_id TEXT PRIMARY KEY,
    hot_retention_days INT DEFAULT 90,
    warm_retention_days INT DEFAULT 365,
    archive_to_cold BOOLEAN DEFAULT TRUE,
    archive_bucket TEXT
);

Compliance-heavy tenants: retention years. Free tier: 30 days.

Replay determinism constraint

Critical: per ADR-019, replay determinism requires log integrity. Pero solo desde el inicio del longest-running instance.

flowchart TD
    S0["State at T=0: empty"] --> C["Commands C1..Cn: t=0...t=now"]
    C --> SN["Snapshot of state at T=now<br/>derivable from C1..Cn"]
    SN --> X["If oldest active instance started at t=X,<br/>need commands from t=X onwards to replay.<br/>Commands before t=X involve only completed instances<br/>(safe to truncate)"]

Find safe truncation point

-- Oldest active instance
SELECT MIN(start_date) AS oldest_active
FROM process_instances
WHERE state IN ('ACTIVE', 'INCIDENT');

-- Result: 2025-03-15 (e.g.)
-- Safe to archive partitions < 2025-03-15 que contengan SOLO completed instances

Conservative approach

Don't truncate active instances. Per-instance threshold:

async def can_archive_partition(partition):
    """Check if partition can be archived safely."""
    cutoff = partition.timestamp_range_end

    # Are there active instances that started BEFORE this partition's end?
    active_count = await db.fetch_val("""
        SELECT COUNT(*) FROM process_instances
        WHERE state IN ('ACTIVE', 'INCIDENT')
          AND start_date < $1
    """, cutoff)

    if active_count > 0:
        return False, f"{active_count} active instances pre-date this partition"

    return True, "safe"

Solo archive partitions cuando NO hay active instances pre-dating them.

Worst case: long-running process

Si tienes process que dura 6 meses, retention efectiva es 6+ months minimum. Document esta restriction.

Archival format

S3 archives son SQL dumps compressed:

pg_dump --table=command_log_y2024m12 \
        --format=custom \
        --compress=9 \
        workflow_db > command_log_y2024m12.dump

aws s3 cp command_log_y2024m12.dump \
        s3://workflow-archives/2024/12/command_log.dump

Compression ratio for JSONB ~3-5x. 60 GB partition → ~12-20 GB archive.

S3 Glacier para >1 year old: $0.004/GB/month. 1 TB = $4/month.

Restore from cold storage

For compliance audits or forensics:

mvp-cli archive restore --partition=2024-12 \
                       --to=command_log_y2024m12_restored

# Engine doesn't see it directly, but admin can query
SELECT * FROM command_log_y2024m12_restored 
WHERE tenant_id = 'acme' AND payload->>'processInstanceKey' = '12345';

Restored tables NOT attached to main partition tree — separate read-only tables.

Per-tenant tracking

CREATE TABLE retention_tracking (
    tenant_id TEXT,
    partition_name TEXT,
    archived_at TIMESTAMPTZ,
    archive_location TEXT,
    archive_size_bytes BIGINT,
    PRIMARY KEY (tenant_id, partition_name)
);

Audit log de qué se archivó cuándo.

Compliance considerations

GDPR right to deletion

Cliente requests deletion of all data for an EU user:

-- Find process instances for this user
DELETE FROM process_instances WHERE variables->>'user_id' = $user_id;

-- Also need to delete from log
DELETE FROM command_log WHERE payload->>'user_id' = $user_id;
DELETE FROM event_log WHERE payload->>'user_id' = $user_id;

Problem: if archived to S3, must also delete from S3.

Implementation: - Maintain user_id index on archived chunks (metadata in main DB) - Cron job for delete-from-archive periodically

Alternative: don't archive PII — strip from variables before archival. Maintain reference to user_id, not full PII.

Audit retention requirements

Industries: - SOX: 7 years - HIPAA: 6 years - Banking: 7-10 years

Configure retention para meet requirements:

INSERT INTO retention_policies (tenant_id, ...)
VALUES ('healthcare-client', 90, 2190, true);  -- 6 years warm

Cold storage hace este cost-feasible.

Cleanup state tables also

Beyond logs, state tables also need cleanup:

-- Old completed process_instances also waste space
-- These are CRUD state derived from log

-- After log archival, can also archive state
DELETE FROM process_instances
WHERE state IN ('COMPLETED', 'CANCELED')
  AND end_date < NOW() - INTERVAL '90 days'
  AND tenant_id IN (SELECT tenant_id FROM retention_policies WHERE hot_retention_days <= 90);

State recoverable from log if needed (replay).

Monitoring

-- Storage por tenant
SELECT 
    tenant_id,
    pg_size_pretty(SUM(pg_total_relation_size(...))) AS total_size
FROM ...
GROUP BY tenant_id
ORDER BY SUM(...) DESC;

-- Partition sizes
SELECT 
    schemaname, tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables
WHERE tablename LIKE 'command_log_%'
ORDER BY tablename;

Alert si tenant > expected size (anomaly):

- alert: TenantStorageHigh
  expr: storage_bytes{tenant=~".+"} > 100e9   # 100GB
  for: 1h

Phase 4+: Citus and archival

Citus shards distribute data. Archival per shard:

-- Each shard handles its own partition archival
-- Coordinator orchestrates schedule

Citus + pg_partman work together natively.

Performance impact

Partition pruning is huge win:

-- Query con timestamp filter
SELECT * FROM event_log 
WHERE timestamp > NOW() - INTERVAL '1 day' 
  AND tenant_id = 'acme';

-- Postgres only scans last partition. Old partitions ignored.

For queries en old data (rare), full scan. Acceptable tradeoff.

Decision summary

Aspect Decision Trigger
Partitioning pg_partman monthly Day 1
Hot retention 90 days default Day 1
Warm retention 365 days default Day 1
Cold storage S3 + Glacier Day 1
Active truncation NO until Phase 3+ When > 1 TB
Per-tenant config Yes Day 1