PostgreSQL, especially versions 9.4 and later, come with many system views, functions and features that expose rich information about the internal workings of your PostgreSQL cluster.

Traditionally, there have been scripts – like the venerable check_postgres and others, usually passed on from DBA to DBA and shared on public and private Wikis – that collect, or check some of these metrics and information.

There hasn’t been a standard, easy way to collect all the information available from your PostgreSQL server, in a way that can be used by DBAs to troubleshoot and ops folks to use in scripting for automation and monitoring.

pgmetrics

pgmetrics aims to be the simplest way to collect comprehensive information and statistics from a running PostgreSQL cluster. It can display this information in a human-readable text format for troubleshooting and diagnosis, or export it as a JSON file for scripting, automation amd monitoring.

We built pgmetrics based on our interactions with the customers of our server, database and service monitoring product OpsDash, as well as from our own needs arising from using PostgreSQL in production.

pgmetrics is open-source (Apache 2.0 licensed), is available now and lives at pgmetrics.io.

Usage

pgmetrics is a single, dependency-free, statically-linked executable that can be easily deployed to any server or machine by just copying it there. It can be invoked just like psql , and takes nearly the same set of command-line parameters and environment variables that you’re used to:

$ pgmetrics --help pgmetrics collects PostgreSQL information and metrics. Usage: pgmetrics [OPTION]... [DBNAME] General options: -t, --timeout=SECS individual query timeout in seconds (default: 5) -S, --no-sizes don't collect tablespace and relation sizes -i, --input=FILE don't connect to db, instead read and display this previously saved JSON file -V, --version output version information, then exit -?, --help[=options] show this help, then exit --help=variables list environment variables, then exit Output options: -f, --format=FORMAT output format; "human", or "json" (default: "human") -l, --toolong=SECS for human output, transactions running longer than this are considered too long (default: 60) -o, --output=FILE write output to the specified file --no-pager do not invoke the pager for tty output Connection options: -h, --host=HOSTNAME database server host or socket directory (default: "/var/run/postgresql") -p, --port=PORT database server port (default: 5432) -U, --username=USERNAME database user name (default: "vagrant") --no-password never prompt for password For more information, visit <https://pgmetrics.io>.

For example, to collect metrics from databases called “main” and “reports” from a server, you might use:

$ pgmetrics -h mypgserver -U alice main reports Password:

Have a look at the information and metrics pgmetrics can collect and display, in the sections below:

Overall Status

Basic, overall information about the server includes last checkpoint, transaction times, checkpoint lag, active backend count and transaction ID age. The range of transaction IDs must be within 2 billion to prevent transaction ID wraparound issues.

PostgreSQL Cluster: Name: staging1 Server Version: 10.2 Server Started: 5 Mar 2018 3:35:49 AM (50 minutes ago) System Identifier: 6529298009807657133 Timeline: 1 Last Checkpoint: 5 Mar 2018 4:25:28 AM (49 seconds ago) Prior LSN: 0/2808EF98 REDO LSN: 0/29882FC0 (24 MiB since Prior) Checkpoint LSN: 0/2BDB92C0 (37 MiB since REDO) Transaction IDs: 548 to 352506 (diff = 351958) Last Transaction: 5 Mar 2018 4:26:17 AM (now) Active Backends: 5 (max 100) Recovery Mode? no

Replication Status

When run an a primary with outgoing streaming replication, pgmetrics displays the state and progress of each replication – the lag, in bytes, between the current state, and the state that has been received, flushed and replayed at the destination.

Outgoing Replication Stats: Destination #1: User: vagrant Application: pg_receivewal Client Address: State: streaming Started At: 5 Mar 2018 3:51:20 AM (34 minutes ago) Sent LSN: 0/2CC2E000 Written Until: 0/2CB88000 (write lag = 664 KiB) Flushed Until: 0/2C000000 (flush lag = 12 MiB) Replayed Until: Sync Priority: 0 Sync State: async Destination #2: User: vagrant Application: pg_recvlogical Client Address: State: streaming Started At: 5 Mar 2018 3:52:22 AM (33 minutes ago) Sent LSN: 0/2CC2DF60 Written Until: 0/2CC11FF0 (write lag = 112 KiB) Flushed Until: 0/2CC11FF0 (no flush lag) Replayed Until: Sync Priority: 0 Sync State: async Destination #3: User: vagrant Application: walreceiver Client Address: State: streaming Started At: 5 Mar 2018 3:55:26 AM (30 minutes ago) Sent LSN: 0/2CC2E000 Written Until: 0/2CC2E000 (no write lag) Flushed Until: 0/2CC2E000 (no flush lag) Replayed Until: 0/2CC2DF60 (replay lag = 160 B) Sync Priority: 0 Sync State: async

Replication Slots

On the master side, a list of replication slots, along with the progress information, is captured.

Physical Replication Slots: +------------+--------+---------------+-------------+-----------+ | Name | Active | Oldest Txn ID | Restart LSN | Temporary | +------------+--------+---------------+-------------+-----------+ | slave_slot | yes | | 0/2CC2E000 | no | | wal_arch | yes | | 0/2C000000 | no | +------------+--------+---------------+-------------+-----------+ Logical Replication Slots: +----------+---------------+----------+--------+---------------+-------------+---------------+-----------+ | Name | Plugin | Database | Active | Oldest Txn ID | Restart LSN | Flushed Until | Temporary | +----------+---------------+----------+--------+---------------+-------------+---------------+-----------+ | logslot1 | test_decoding | bench | yes | | 0/28510670 | 0/2CC11FF0 | no | +----------+---------------+----------+--------+---------------+-------------+---------------+-----------+

Standby Replication Status

When pgmetrics is run on a hot standby, it collects the recovery and replication status. This shows how much data has been received and replayed at the standby end. If replication slots are used, richer data is available as seen below:

Recovery Status: Replay paused: no Received LSN: 0/2EA4A000 Replayed LSN: 0/2EA49FB8 (lag = 72 B) Last Replayed Txn: 5 Mar 2018 4:26:42 AM (now) Incoming Replication Stats: Status: streaming Received LSN: 0/2EA4A000 (started at 0/11000000, 474 MiB) Timeline: 1 (was 1 at start) Latency: 11.505ms Replication Slot: slave_slot

WAL Archiving

When WAL archiving is enabled, these stats are also collected and displayed. The number of WAL files in the pg_wal (or pg_xlog in older PostgreSQL versions) directory, as well as the number of “ready” files are also collected.

Relevant configuration settings are also displayed in this section. pgmetrics will collect all settings, including changes from default, it any.

WAL Files: WAL Archiving? yes WAL Files: 17 Ready Files: 0 Archive Rate: 0.91 per min Last Archived: 5 Mar 2018 4:26:05 AM (12 seconds ago) Last Failure: Totals: 46 succeeded, 0 failed Totals Since: 5 Mar 2018 3:35:50 AM (50 minutes ago) +--------------------+---------------+ | Setting | Value | +--------------------+---------------+ | wal_level | logical | | archive_timeout | 120 | | wal_compression | on | | max_wal_size | 1024 (16 GiB) | | min_wal_size | 80 (1.3 GiB) | | checkpoint_timeout | 60 | | full_page_writes | on | | wal_keep_segments | 10 | +--------------------+---------------+

BG Writer

The stats for the bgwriter process include the distribution of scheduled and requested checkpoints, write frequency and amounts, buffer writes segregated by cause, and other information.

BG Writer: Checkpoint Rate: 1.05 per min Average Write: 12 MiB per checkpoint Total Checkpoints: 47 sched (88.7%) + 6 req (11.3%) = 53 Total Write: 1.6 GiB, @ 540 KiB per sec Buffers Allocated: 116427 (910 MiB) Buffers Written: 85061 chkpt (41.6%) + 72964 bgw (35.7%) + 46451 be (22.7%) Clean Scan Stops: 174 BE fsyncs: 0 Counts Since: 5 Mar 2018 3:35:49 AM (50 minutes ago) +------------------------------+--------------+ | Setting | Value | +------------------------------+--------------+ | bgwriter_delay | 200 msec | | bgwriter_flush_after | 64 (512 KiB) | | bgwriter_lru_maxpages | 100 | | bgwriter_lru_multiplier | 2 | | block_size | 8192 | | checkpoint_timeout | 60 sec | | checkpoint_completion_target | 0.5 | +------------------------------+--------------+

Backends

The report about active backends highlights common causes for concern, like transactions that have been open for too long, or are idling. Processes waiting for various reasons, including locks, are also called out:

Backends: Total Backends: 4 (4.0% of max 100) Problematic: 3 waiting, 2 xact too long, 2 idle in xact Waiting: +------+---------+------+-------------+----------+---------------------+-----------------------+ | PID | User | App | Client Addr | Database | Wait | Query Start | +------+---------+------+-------------+----------+---------------------+-----------------------+ | 7024 | vagrant | psql | | postgres | Client / ClientRead | 5 Mar 2018 5:01:05 AM | | 7210 | vagrant | psql | | postgres | Lock / relation | 5 Mar 2018 5:00:35 AM | | 7213 | vagrant | psql | | postgres | Client / ClientRead | 5 Mar 2018 5:00:26 AM | +------+---------+------+-------------+----------+---------------------+-----------------------+ Long Running (>60 sec) Transactions: +------+---------+------+-------------+----------+--------------------------------------+ | PID | User | App | Client Addr | Database | Transaction Start | +------+---------+------+-------------+----------+--------------------------------------+ | 7210 | vagrant | psql | | postgres | 5 Mar 2018 4:59:57 AM (1 minute ago) | | 7213 | vagrant | psql | | postgres | 5 Mar 2018 5:00:10 AM (1 minute ago) | +------+---------+------+-------------+----------+--------------------------------------+ Idling in Transaction: +------+---------+------+-------------+----------+----------+-----------------------+ | PID | User | App | Client Addr | Database | Aborted? | State Change | +------+---------+------+-------------+----------+----------+-----------------------+ | 7024 | vagrant | psql | | postgres | no | 5 Mar 2018 5:01:05 AM | | 7213 | vagrant | psql | | postgres | no | 5 Mar 2018 5:00:26 AM | +------+---------+------+-------------+----------+----------+-----------------------+

Vacuum Progress

Information of ongoing vacuum and autovacuum jobs are also captured by pgmetrics. This is helpful in diagnosing “stuck” vacuum jobs and also trying to make an educated guess about when ongoing jobs will finish.

Vacuum Progress: Vacuum Process #1: Phase: scanning heap Database: postgres Table: Scan Progress: 15369 of 21589 (71.2% complete) Heap Blks Vac'ed: 15368 of 21589 Idx Vac Cycles: 0 Dead Tuples: 53 Dead Tuples Max: 291 +------------------------------+----------------+ | Setting | Value | +------------------------------+----------------+ | maintenance_work_mem | 65536 (64 KiB) | | autovacuum | on | | autovacuum_analyze_threshold | 50 | | autovacuum_vacuum_threshold | 50 | | autovacuum_freeze_max_age | 200000000 | | autovacuum_max_workers | 3 | | autovacuum_naptime | 60 sec | | vacuum_freeze_min_age | 50000000 | | vacuum_freeze_table_age | 150000000 | +------------------------------+----------------+

Roles

All the roles (user and groups) in the cluster, including group membership and attributes like superuser, connection limit etc. are also captured by pgmetrics. The display is a bit too wide for this blog post, so we’re omitting it here.

Tablespaces

Each tablespace, it’s location and the size consumed (as reported by pg_tablespace_size ) is collected by pgmetrics. If run locally, it also examines the mounted filesystem where the tablespace is located and reports the disk and inode usage for that filesystem.

Tablespaces: +------------+---------+-------------------------------+---------+----------------------------+-------------------------+ | Name | Owner | Location | Size | Disk Used | Inode Used | +------------+---------+-------------------------------+---------+----------------------------+-------------------------+ | pg_default | vagrant | $PGDATA = /home/vagrant/data1 | 249 MiB | 3.4 GiB (39.3%) of 8.7 GiB | 59889 (10.3%) of 584064 | | pg_global | vagrant | $PGDATA = /home/vagrant/data1 | 573 KiB | 3.4 GiB (39.3%) of 8.7 GiB | 59889 (10.3%) of 584064 | | s1 | vagrant | /dev/shm/s93 | 3.5 MiB | 10 MiB (1.0%) of 1002 MiB | 21 (0.0%) of 256561 | +------------+---------+-------------------------------+---------+----------------------------+-------------------------+

Databases

The stats for each database includes the commit ratio, cache efficiency, the age of oldest active transaction ID, the size in bytes etc. Also included are the list of installed extensions, the cache efficiency of sequence objects and the list of disabled triggers.

Database #1: Name: postgres Owner: vagrant Tablespace: pg_default Connections: 3 (no max limit) Frozen Xid Age: 484454 Transactions: 159 (97.5%) commits, 4 (2.5%) rollbacks Cache Hits: 99.2% Rows Changed: ins 83.3%, upd 0.0%, del 16.7% Total Temp: 1.3 MiB in 1 files Problems: 0 deadlocks, 0 conflicts Totals Since: 5 Mar 2018 3:36:08 AM (1 hour ago) Size: 11 MiB Sequences: +---------------+------------+ | Sequence | Cache Hits | +---------------+------------+ | seqtest_a_seq | 50.0% | +---------------+------------+ Installed Extensions: +---------+---------+------------------------------+ | Name | Version | Comment | +---------+---------+------------------------------+ | plpgsql | 1.0 | PL/pgSQL procedural language | +---------+---------+------------------------------+ Disabled Triggers: +------+---------------+-----------+ | Name | Table | Procedure | +------+---------------+-----------+ | tr1 | public.trtest | trigfn | +------+---------------+-----------+

Tables and Indexes

For each table, the last vacuum and analyze information, as well as estimates of live and dead rows are collected. Stats like percentage of updates that are HOT updates, cache efficiency for table and indexes, size, bloat, the use of sequential and index scans etc are included.

Stats for each index associated with the table, like cache efficiency and rows fetched/scan etc. are also listed.

The bloat figure calculated by pgmetrics uses the query taken from check_postgres.

Table #2 in "bench": Name: bench.public.pgbench_tellers Manual Vacuums: 2, last 40 minutes ago Manual Analyze: 1, last 51 minutes ago Auto Vacuums: 6, last 12 minutes ago Auto Analyze: 11, last 12 minutes ago Post-Analyze: 99.3% est. rows modified Row Estimate: 0.7% live of total 14709 Rows Changed: ins 0.0%, upd 75.8%, del 0.0% HOT Updates: 75.8% of all updates Seq Scans: 152045, 100.0 rows/scan Idx Scans: 332274, 1.0 rows/scan Cache Hits: 100.0% (idx=100.0%) Size: 4.3 MiB Bloat: 5.6 MiB (131.6%) +----------------------+------------+--------+----------------+-------------------+ | Index | Cache Hits | Scans | Rows Read/Scan | Rows Fetched/Scan | +----------------------+------------+--------+----------------+-------------------+ | pgbench_tellers_pkey | 100.0% | 332274 | 107.2 | 1.0 | +----------------------+------------+--------+----------------+-------------------+

System Information

Finally, pgmetrics also captures system-level information (if run locally, of course).

System Information: Hostname: stretch CPU Cores: 2 x Intel(R) Core(TM) i5-3450 CPU @ 3.10GHz Load Average: 0.07 Memory: used=174 MiB, free=76 MiB, buff=40 MiB, cache=1.7 GiB Swap: used=2.4 MiB, free=1020 MiB +---------------------------------+-----------------+ | Setting | Value | +---------------------------------+-----------------+ | shared_buffers | 16384 (128 MiB) | | work_mem | 4096 (4.0 MiB) | | maintenance_work_mem | 65536 (64 MiB) | | temp_buffers | 1024 (8.0 MiB) | | autovacuum_work_mem | -1 | | temp_file_limit | -1 | | max_worker_processes | 8 | | autovacuum_max_workers | 3 | | max_parallel_workers_per_gather | 2 | | effective_io_concurrency | 1 | +---------------------------------+-----------------+

Availability

pgmetrics is available as a single binary for Linux, Windows and macOS, for 64-bit platforms. You can download these from the GitHub releases page. You can easily build pgmetrics for other platforms yourself using a Go development environment, read more here.

System metrics (disk space, memory usage etc) are currently supported only for Linux servers.

pgmetrics currently runs on PostgreSQL versions from 9.3 to 10. Patches to make it work on other versions are welcome.

Get Involved!

We’d be happy to hear your bug reports, suggestions and feedback; and incorporate them to make pgmetrics even more useful. Find out more at the pgmetrics home page at pgmetrics.io.