Postgres Monitoring
La doc oficial de Camunda 8 (
docs/monorepo-docs/architecture/components/rdbms/benchmarking.md) provee queries SQL para diagnosticar performance del backend RDBMS: pg_stat_activity (sesiones), pg_stat_statements (top queries por costo), cache hit ratio, dead tuples, índices no usados, WAL buffer. Reglas operacionales: cache hit > 90%, dead tuples < 5%, wal_buffers_full = 0.
Setup: pg_stat_statements¶
Requerido para análisis de queries. Conectarse como postgres admin:
Para resetear contadores acumulados:
Análisis de sesiones activas¶
SELECT * FROM pg_stat_activity
WHERE state = 'active'
AND usename = 'camunda'
AND client_addr IS NOT NULL;
Campos clave:
- wait_event — si no es null, la sesión está bloqueada
- wait_event_type — Lock, IO, Client, BufferPin, Extension
- query_start — sesiones de larga duración indican queries lentas
Top 50 queries más caras¶
Esta es la query más importante para detectar bottlenecks:
SELECT query,
round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 3) AS percentage_of_total_time,
round((total_exec_time / NULLIF(rows, 0) * 1000)::numeric, 2) AS exec_time_per_1000_rows_ms,
round(rows / NULLIF(calls, 0)) AS avg_rows_per_call,
round((total_exec_time / NULLIF(calls, 0))::numeric, 2) AS avg_time_per_call_ms,
round(total_exec_time::numeric, 2) AS total_exec_time_ms,
round(mean_exec_time::numeric, 2) AS mean_time_ms,
round(min_exec_time::numeric, 2) AS min_time_ms,
round(max_exec_time::numeric, 2) AS max_time_ms,
calls, rows
FROM pg_stat_statements
WHERE rows >= 0
AND query NOT LIKE 'SELECT%' -- Focus on writes
ORDER BY total_exec_time DESC
LIMIT 50;
Métrica más relevante: exec_time_per_1000_rows_ms — porque el RdbmsExporter batchea inserts/updates en un solo statement. Tiempo por 1000 filas es más significativo que tiempo por call.
Estadísticas de tablas¶
Tamaños¶
SELECT tablename,
pg_size_pretty(pg_relation_size(...)) AS table_size,
pg_size_pretty(pg_indexes_size(...)) AS indexes_size,
pg_size_pretty(pg_total_relation_size(...)) AS total_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(...)::bigint DESC;
Identifica tablas con índices desproporcionados (potencial problema).
Dead tuples y vacuum¶
SELECT schemaname, relname AS tablename,
n_dead_tup AS dead_tuples,
n_live_tup AS live_tuples,
round((100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0))::numeric, 2) AS dead_tuple_percent,
last_vacuum, last_autovacuum, last_analyze, last_autoanalyze
FROM pg_stat_user_tables
WHERE schemaname = 'public' AND n_live_tup > 0
ORDER BY dead_tuple_percent DESC NULLS LAST
LIMIT 30;
Regla oficial: dead_tuple_percent > 5% → tabla necesita vacuum más frecuente.
Cache hit ratio (tablas)¶
SELECT schemaname, relname AS tablename,
heap_blks_read AS disk_reads,
heap_blks_hit AS cache_hits,
round((100.0 * heap_blks_hit / NULLIF(heap_blks_hit + heap_blks_read, 0))::numeric, 2) AS cache_hit_ratio_percent,
pg_size_pretty(pg_relation_size(schemaname||'.'||relname)) AS table_size
FROM pg_statio_user_tables
WHERE schemaname = 'public' AND (heap_blks_hit + heap_blks_read) > 1000
ORDER BY cache_hit_ratio_percent ASC NULLS LAST;
Regla oficial: cache_hit_ratio < 90% → más memoria o índices necesarios.
Estadísticas de índices¶
Cache hit ratio (índices)¶
SELECT schemaname, relname, indexrelname,
idx_blks_read, idx_blks_hit,
round((100.0 * idx_blks_hit / NULLIF(idx_blks_hit + idx_blks_read, 0))::numeric, 2) AS cache_hit_ratio_percent
FROM pg_statio_user_indexes
WHERE schemaname = 'public' AND (idx_blks_hit + idx_blks_read) > 1000
ORDER BY cache_hit_ratio_percent ASC NULLS LAST
LIMIT 30;
Detección de índices no usados¶
Crítico: índices no usados degradan write performance:
SELECT schemaname, relname, indexrelname,
idx_scan AS number_of_scans,
idx_tup_read AS tuples_read,
idx_tup_fetch AS tuples_fetched,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
CASE
WHEN idx_scan = 0 THEN 'UNUSED - Consider dropping'
WHEN idx_scan < 50 THEN 'RARELY USED - Review necessity'
ELSE 'ACTIVELY USED'
END AS usage_status
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY idx_scan ASC, pg_relation_size(indexrelid) DESC;
Acciones:
- idx_scan = 0: candidato a DROP
- tuples_read >> tuples_fetched: índice no es selectivo
Análisis de WAL¶
SELECT wal_records,
wal_fpi AS full_page_images,
round((wal_bytes / 1024.0 / 1024.0 / 1024.0 /
NULLIF(EXTRACT(EPOCH FROM (now() - stats_reset)) / 3600.0, 0))::numeric, 2) AS wal_gb_per_hour,
wal_buffers_full,
stats_reset
FROM pg_stat_wal;
Métricas críticas:
- wal_buffers_full: idealmente 0. Si > 0, aumentar wal_buffers (default 16MB)
- wal_gb_per_hour: GB de WAL por hora — capacity planning de disco
Análisis de queries específicas¶
Detecta: índices faltantes, query plans subóptimos, I/O excesivo.
Implicaciones para el MVP¶
Estas queries son reusables directamente para un MVP basado en PostgreSQL:
- Dashboard de monitoring desde día 1: las queries son la fuente de métricas
- Reglas operacionales claras:
- Cache hit ratio > 90% (tablas e índices)
- Dead tuples < 5%
- wal_buffers_full = 0
- pg_stat_statements como métrica primaria — más útil que profilers a nivel app
- Batch operations: optimizar para
exec_time_per_1000_rows_ms, no por call - Detección de índices no usados: alerta automática para eliminar índices que degraden writes
- Capacity planning: tracking de
wal_gb_per_hourpara dimensionar discos
Las queries deben ser parte del runbook operacional del MVP, idealmente integradas en Grafana con thresholds.