Main PostgreSQL troubleshooting monitoring dashboard + runbook

PostgreSQL basic troubleshooting methodology and monitoring

For system performance troubleshooting, there are various popular methodologies, such as:

For databases, and Postgres specifically, it is good to use a combined approach, to have a straightforward methodology for troubleshooting of incidents and reviewing Postgres performance:

for system-level resources, use one of the common approaches
additionally (and this is very important), apply some variation of those approaches to DB-level resources and entities – shared resources, locks at various levels, DB objects, query groups (normalized queries), etc.

There is good existing work describing this. For example, these two materials are great and I highly recommend checking them out:

Alexey Lesovsky: "PostgreSQL, RED, Golden Signals: getting started" (2020) (blog post)
Peter Zaitsev: "Performance Analysis and Troubleshooting Methodologies for Databases" (FOSDEM-2021 talk, video + slides)

In 2020, we had a series of community sessions reviewing various monitoring tools for Postgres, as result this doc was created: Postgres monitoring review checklist (2020)

And here is a simplified document describing only "dashboard #1" for PostgreSQL troubleshooting – something that any good monitoring should have, it is based on the materials above: PostgreSQL Troubleshooting & Monitoring – Checklist / Runbook

One of the key principles of "dashboard/runbook #1" is: we need to have a very wide, but not yet deep overview of key aspects of PostgreSQL health, allowing us to perform the analysis in a very short time, potentially during an incident -- identifying specific topics that need deeper further analysis.

Troubleshooting tools for GitLab.com

For GitLab.com, we have quite advanced observability tools, some of which are great and solve particular tasks really well. However, there is no well-designed "dashboard #1" and "runbook #1" for PostgreSQL troubleshooting. What we have now:

Runbook for Postgres troubleshooting https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/postgres.md#postgresql-troubleshooting is quite unbalanced: it lacks some basic things like analysis of sessions by type or historical analysis of long-running sessions, but at the same time dives quite deeply into specific topics.
Grafana dashboards: we have "PostgreSQL Overview" and "patroni overview", both looking imbalanced too – mixing system-level and PostgreSQL-level things, developed quite long ago and missing important things: analysis of long-running transactions, good view of session counts by state, WAL archive rates, analysis of autovacuum activity, etc. Here are screenshots with the list of what we have for both right now:
Prometheus is also a mixed bag of things, sometimes with great metrics collected, but sometimes with issues like this one: gitlab-cookbooks/gitlab-exporters!235 (merged)

All these aspects lead to the process of troubleshooting that is far from ideal, we lose time hunting for important data, apply manual checks, etc.

With more PostgreSQL clusters coming (due to vertical decomposition and sharding efforts), this will be hitting us more and more. We need to improve tooling, have simple and straightforward instructions and shift operations with databases considering the Postgres nodes more like "kettle" rather than "pets" (so, having less deep but very simple, wide, and complete overview in the "dashboard / runbook #1").

Proposal, step 1: basic troubleshooting runbook for Postgres

Based on our (Postgres.ai) experience and various materials including those mentioned above, we think the main runbook (and therefore, monitoring dashboard) has to include the following:

TPS and (optional but also desired) QPS
Latency (query duration) – at least average. Better: histogram, percentiles
Connections (sessions) – stacked graph of session counts by state (first of all: active and idle-in-transaction; also interesting: idle, others) and how far the sum is from max_connection (+pool size for pgBouncer).
Longest transactions (max transaction age or top-n transactions by age), excluding autovacuum activity
Commits vs rollbacks – how many transactions are rolled back
Transactions left till transaction ID wraparound
Replication lags / bytes in replication slot / unused replication slots
Count of WALs waiting to be archived
WAL generation rates
Locks
[advanced, but also very desired] Basic query analysis graph (top-n by total_time or by mean_time?)
[advanced, but also very desired] Basic wait event analysis (a.k.a. “active session analysis” or “performance insights”)

We can develop such a runbook, consider it as an entry point for troubleshooting, and restructure existing runbooks to match it.

Proposal, step 2: better monitoring

Implementing a new runbook using existing monitoring will be far from efficient. We need to catch up. It seems to be a big task, that will require significant efforts, considering the complexity of the current monitoring system:

for the "dashboard #1" – ~1 month
to going deeper, improving on various other aspects of Postgres monitoring – several months

Meanwhile, during incidents we lack good tooling, what we have is far from the state-of-art in terms of PostgreSQL existing methodologies and tools (even just open-source ones) that advanced significantly during the last few years.

A fresh example is a series of incidents related to subtransactions – multiple incidents tracked in this "umbrella" issue https://gitlab.com/gitlab-org/gitlab/-/issues/334638. It took almost a month (!) to find the root cause.

That's why I'm proposing the following:

We have a quick and efficient solution – pgwatch2, with a redesigned set of dashboards, adjusted to the approached described here. Both pgwatch2 itself, and our (Postgres.ai's) adjustments are open source (BSD3). We can install it on a separate machine, and it will be querying all Postgres nodes not frequently - every 1 minute or a few. The footprint is very low and tested in some heavily loaded production envs already. To be on the safe side, we could do additional research of the footprint of those queries – checking how the produced queries look like (using pg_stat_statements, applied for a copy of gitlab.com DB, without concurrent traffic – quite simple and straightforward check). Next, we could start using it for 1 replica only, for a few days – and then roll it out for all nodes. Immediate benefits:
- it's state-of-the-art monitoring for Postgres, covering all major topics both in a simple "wide but not deep" manner on the main dashboard, and deeper in additional dashboards. Demo for a synthetic DB/workload: http://dev1.postgres.ai:3000 (user: read, pass: read)
- setup is simple, everything in docker, config is in a separate Postgres DB, all is contained in a single machine, using InfluxDB and Grafana - so for quick and dirty "let's catch up" phase it's ideal
- additionally, there is Admin GUI that allows changing dashboards and charts quickly (can provide admin access to the demo setup upon request) – good to iterate very quickly; at the same time, the configuration can be stored in files too -- to allow a GitOps workflow for changes
- it's fully compatible with the ideas for good query analysis I raised in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12197. Including pg_stat_statements, pg_stat_kcache (physical-level metrics for query analysis such as CPU, context switches, and real disk IO), and pg_wait_sampling (alternative to pgsentinel) for wait event analysis (a.k.a. active session history).
- we there is a consensus, we can have it in 1-2 weeks.
Next, we catch up in the exiting, main monitoring – implement the new main dashboard, adjust others to have all that is missing. As I mentioned above, this will require more effort and will take a few months.
Once we are confident that the main monitoring is developed enough, we can sunset the pgwatch2 – or keep it if we see new benefits from having it. Living with 2 (sometimes 3) monitoring for long is what I observed not once in quite large organizations).

I'm very sure: this is a core, very basic need to scale PostgreSQL and related processes.

Edited Jul 28, 2021 by Nikolay Samokhvalov