Postgres query analysis: extend methodology to see which queries are the biggest consumers for concrete resources (CPU, disk IO, memory IO)

Recently, together with @Finotto, we've had a few sessions of SQL query analysis where we used an ad-hoc tool, PASH Viewer, to understand the SQL workload better (here is an example of the analysis, and here is one more, and one more here). The usual tool, pg_stat_statements ("pgss") was not enough and here I'm going to explain why + I will raise the question of what needs to be done to have something permanently, to achieve better observability results.

It turned out that we didn't explain well that we're applying a new methodology compared to what we were usually using -- based on pg_stat_statements.

The problem (why pg_stat_statements is not enough sometimes)

The classic approach based on pg_stat_statements is good when we aim to keep SQL workload under control, to scale better. Usually, we considered total_time as the main metric if we aim to reduce the load on Postgres nodes, and took Top-N queries, analyzing them starting from position 1. If position 1 query was responsible for more than 40% of overall total_time it was considered as a very bad situation that requires attention sooner.

This approach was used in postgres-checkup reports, where we had more detailed analysis but for a specific period of time (mid-day UTC each Monday), and in Prometheus, where we had fewer details but a better historical view since the pgss data is being continuously collected there.

A couple of weeks ago, @Finotto pointed to the fact that CPU load on the primary is sometimes 50% with spikes to 70% (later to 80%), and this looked not good. So we had a goal to identify the root cause of increased CPU load, especially for the spikes.

If we just use pgss's data, we cannot see what caused CPU load – query timing consists of multiple parts such as:

CPU
IO (pgss can provide IO timing at the level of reading from the disk page cache to Postgres buffer pool, but not pure disk IO timing)
waiting for a lock to be acquired
communicating with clients

Why should we care and why do we need to "zoom" into this? If our primary current goal is the analysis of CPU load, then we need to exclude the other parts. Fr example, if a query has increased timing due to waiting for a lock to be acquired, this doesn't lead to resource consumption, this doesn't put load to CPU. But this waiting period is reflected in timing (particularly, in pgss' total_time). Similarly, as @ahachete absolutely correctly pointed out, during IO, CPU is not actually used – so if, in a particular activity, the primary goal is to reduce CPU load, we need to subtract IO timing from total_time.

Therefore, we need something else in addition to pg_stat_statements.

What are the options?

There are two options that can give us an understanding of which queries consume concrete resource the most (say, CPU, or disk IO):

Extend pg_stat_statements with pg_stat_kcache.
Add an alternative approach based on the sampling of queries in pg_stat_activity, aggregating them (similarly to what pgss does), and paying attention to wait_even and wait_event_type columns which give information of what exactly the backend executing the query was doing when we looked at it.

1) extending pgss with pg_stat_kcache

URL: https://github.com/powa-team/pg_stat_kcache

It's a quite known extension developed by the PoWA team. It's definitely capable of providing useful info: CPU load and disk IO, but it won't give info about timing spent for communicating to the client or waiting for a lock to be acquired ("blocked" state) – these things can be understood only indirectly (subtracting CPU and disk IO timing from total_time).

It's also used in heavily loaded systems – I personally have a good experience.

It also can show the number of context switches associated with a particular SQL query group. The full list of details it provides: https://github.com/powa-team/pg_stat_kcache#pg_stat_kcache-view

2) sampling pg_stat_activity analyzing wait_event_type

This approach is used in RDS, and originates from the Oracle world. Moreover, RDS has a strong focus on this approach, providing all customers with the tool called Performance Insights.

The idea is quite simple: if we take a snapshot of pg_stat_activity every few seconds, then aggregate queries (removing parameters from them – similarly to what pgss does) and analyze wait_event_type, we can draw a picture like this (an example from RDS Performance Insights):

There are many wait_event values defined, see (for Postgres 11): https://www.postgresql.org/docs/11/monitoring-stats.html#WAIT-EVENT-TABLE. Those events are grouped into these types (wait_event_type). From the official documentation:

LWLock: The backend is waiting for a lightweight lock. Each such lock protects a particular data structure in shared memory. wait_event will contain a name identifying the purpose of the lightweight lock. (Some locks have specific names; others are part of a group of locks each with a similar purpose.)
Lock: The backend is waiting for a heavyweight lock. Heavyweight locks, also known as lock manager locks or simply locks, primarily protect SQL-visible objects such as tables. However, they are also used to ensure mutual exclusion for certain internal operations such as relation extension. wait_event will identify the type of lock awaited.
BufferPin: The server process is waiting to access to a data buffer during a period when no other process can be examining that buffer. Buffer pin waits can be protracted if another process holds an open cursor which last read data from the buffer in question.
Activity: The server process is idle. This is used by system processes waiting for activity in their main processing loop. wait_event will identify the specific wait point.
Extension: The server process is waiting for activity in an extension module. This category is useful for modules to track custom waiting points.
Client: The server process is waiting for some activity on a socket from user applications, and that the server expects something to happen that is independent from its internal processes. wait_event will identify the specific wait point.
IPC: The server process is waiting for some activity from another process in the server. wait_event will identify the specific wait point.
Timeout: The server process is waiting for a timeout to expire. wait_event will identify the specific wait point.
IO: The server process is waiting for a IO to complete. wait_event will identify the specific wait point.

A similar analysis can be achieved for self-managed Postgres. Open-source tools for it:

PASH Viewer – an "ad hoc" (that we used), a Java app that connects to the database and reads from pg_stat_activity periodically, and immediately performs the analysis described above, drawing pictures
pgCenter – also an "ad hoc" tool, that also can sample pg_stat-activity and record samples to a file, which later can be used for analysis. It doesn't draw anything, but has more capabilities, and can be used as a lower-level tool in automation
pgsentinel – an extension that can be installed to Postgres to accumulate samples of pg_stat_activity. It uses pgss for tracking which queries get executed in your database. It doesn't visualize anything but can be used as a source of metrics for existing monitoring. Interesting fact: if pgsentinel is installed, PASH Viewer can detect it and create a visualization for the past, when launched. I lack experience in using it under heavy loads, so it should be properly tested first.
pg_wait_sampling, another extension for sampling. Also should be thoroughly teststed under load first (there are reported locking issues: one, two). PoWA monitoring supports it (see the comments below).

Proposal

What I propose:

Either extend pg_stat_statement-based analysis with pg_stat_kcache or add sampling of pg_stat_activity (installing pgsentinel). Or do both.
Periodically, analyze Top-N queries/transactions/sessions which are most consumers of CPU or disk IO or cause locking issues and optimize them

cc @Finotto @andrewn @abrandl @rhenchen.gitlab

Edited May 12, 2022 by Rafael Henchen