在 云原生数据库 PolarDB(基于PostgreSQL引擎)的日常运维中,数据库管理员(DBA)需要借助一系列系统视图和统计函数来掌握集群运行状态。本文将梳理从数据库全局统计到复制延迟、存储空间、事务健康度等维度的核心监控SQL,助力高效运维。
数据库级别活动统计
此部分SQL用于监控数据库实例的整体活动情况,包括事务提交回滚、数据块读写、元数据操作等关键指标。
-- 查看数据库级别统计
SELECT datname, xact_commit, xact_rollback, blks_read, blks_hit,
tup_returned, tup_fetched, tup_inserted, tup_updated, tup_deleted,
deadlocks
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1');
-- 查看数据库冲突统计(适用于只读实例)
SELECT datname, confl_tablespace, confl_lock, confl_snapshot,
confl_bufferpin, confl_deadlock
FROM pg_stat_database_conflicts;
复制与流复制状态监控
监控主从复制链路的状态与延迟是保障高可用性的关键。
-- 主库:查看流复制状态
SELECT pid, state, client_addr, sync_state,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) as send_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) as write_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) as flush_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as replay_lag
FROM pg_stat_replication;
-- 从库:查看 WAL 接收状态
SELECT status, receive_start_lsn, receive_start_tli,
pg_wal_lsn_diff(pg_last_wal_receive_lsn(), receive_start_lsn) as received_bytes,
pg_wal_lsn_diff(pg_last_wal_replay_lsn(), receive_start_lsn) as replayed_bytes,
pg_last_wal_receive_time, pg_last_wal_replay_time
FROM pg_stat_wal_receiver;
-- 计算复制延迟(字节)
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn()) AS replication_lag_bytes;
长时间运行操作进度报告
对于VACUUM、创建索引等可能长时间运行的操作,可以通过进度视图实时了解其完成情况。
-- VACUUM 进度
SELECT pid, datname, relname, phase, heap_blks_total, heap_blks_scanned,
heap_blks_vacuumed, index_vacuum_count, max_dead_tuples,
num_dead_tuples
FROM pg_stat_progress_vacuum;
-- CREATE INDEX/REINDEX 进度
SELECT pid, datname, relname, index_relname, phase,
blocks_total, blocks_done, tuples_total, tuples_done,
partitions_total, partitions_done
FROM pg_stat_progress_create_index;
-- ANALYZE 进度
SELECT pid, datname, relname, phase,
sample_blks_total, sample_blks_scanned,
ext_stats_total, ext_stats_computed
FROM pg_stat_progress_analyze;
系统级后台进程统计
监控检查点、WAL写入、归档等后台进程的活动,有助于评估系统底层性能。
-- 后台写入器统计
SELECT checkpoints_timed, checkpoints_req, checkpoint_write_time,
checkpoint_sync_time, buffers_checkpoint, buffers_clean,
maxwritten_clean, buffers_backend, buffers_backend_fsync,
buffers_alloc
FROM pg_stat_bgwriter;
-- WAL 统计
SELECT wal_records, wal_fpi, wal_bytes, wal_buffers_full,
wal_write_time, wal_sync_time, wal_write_lag,
wal_sync_lag, wal_replay_lag
FROM pg_stat_wal;
-- 归档进程统计
SELECT archived_count, failed_count, last_archived_wal,
last_archived_time, last_failed_wal, last_failed_time,
stats_reset
FROM pg_stat_archiver;
存储空间使用监控
定期检查数据库、表、索引的大小增长趋势,是容量规划与空间清理的基础。
-- 查看各数据库大小
SELECT pg_database.datname,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database;
-- 查看表大小排序(Top 20)
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
pg_size_pretty(pg_indexes_size(schemaname||'.'||tablename)) AS index_size
FROM pg_tables
WHERE schemaname NOT IN ('information_schema', 'pg_catalog')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;
-- WAL 日志大小检查
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS current_wal_size;
事务年龄与长事务监控
监控事务ID年龄是预防“事务ID回卷”这一严重故障的核心手段,同时需要关注长事务对系统的影响。
-- 查看持有事务ID的长事务
SELECT pid, age(backend_xid) AS xid_age, age(backend_xmin) AS xmin_age,
query, state, backend_start, xact_start, query_start
FROM pg_stat_activity
WHERE backend_xid IS NOT NULL OR backend_xmin IS NOT NULL
ORDER BY greatest(age(backend_xid), age(backend_xmin)) DESC;
-- 查看各数据库的事务年龄
SELECT datname, age(datfrozenxid) AS frozenxid_age,
pg_size_pretty(pg_database_size(datname)) AS size
FROM pg_database
ORDER BY age(datfrozenxid) DESC;
-- 检查已准备但未提交的事务(两阶段提交)
SELECT transaction, gid, owner, database, prepared
FROM pg_prepared_xacts
WHERE age(transaction) > 1000000;
SLRU缓存与数据库缓存命中率
SLRU(简单LRU)缓存用于管理事务提交日志等结构,其命中率及数据库缓冲区的命中率是性能调优的重要参考。
-- SLRU 缓存统计
SELECT name, blks_zeroed, blks_hit, blks_read, blks_written,
blks_exists, flushes, truncates, stats_reset
FROM pg_stat_slru;
-- 计算各数据库的缓存命中率
SELECT datname,
blks_hit::float / (blks_hit + blks_read) * 100 AS cache_hit_ratio,
blks_read, blks_hit
FROM pg_stat_database
WHERE blks_hit + blks_read > 0
ORDER BY cache_hit_ratio;
连接会话与等待事件分析
分析连接来源、会话状态及等待事件,是进行连接池优化和排查性能瓶颈的常用方法。
-- 查看空闲事务(可能持有锁)
SELECT pid, state, backend_start, query_start,
age(clock_timestamp(), query_start) AS idle_duration,
query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY query_start;
关键运维注意事项
- 事务年龄:务必定期检查,避免其接近20亿的事务ID回卷限制,必要时需执行
VACUUM FREEZE。
- 复制延迟:监控时建议同时关注字节延迟(
replication_lag_bytes)和时间延迟(replay_lag),以全面评估影响。
- 进度报告:
pg_stat_progress_* 视图仅在对应操作执行期间有数据,操作结束即清空。
- 缓存命中率:若数据库级缓存命中率持续低于95%,可考虑适当调大
shared_buffers参数。
- 操作监控:对于预计运行时间较长的
VACUUM或CREATE INDEX操作,可通过进度视图实时跟踪。
- WAL管理:若WAL日志大小增长过快,需检查并调整
wal_keep_size参数或归档任务的策略与效率。