Monitor "autovacuum queue" on Postgres primary
The query https://gitlab.com/snippets/1889668 can be used to monitor the "autovacuum queue": the list of tables that need to be processed by autovacuum but not yet being processed + size of this list.
Goals:
- be able to see if some tables that need processing don't get it and "wait" (we could even use some metric that is similar to "load average" and have some alerts: if we have N workers, and 2*N tables are "waiting" to be processed, it is time to trigger an alert). Here we need to know only the size of the "queue". @emanuel_ongres do you think it is easy to add it to Prometheus?
- in case if troubleshooting is needed, it would be great to see details in logs. I usually achieve this using some cronjob with plpgsql form of that query, running every N minutes and logging details. What is the best option to do it in the case of GitLab.com infrastructure?
I think we should split the achieving of these two goals. Having item 1 is much more important. However, having item 2 can be extremely helpful as well, for the troubleshooting of autovacuum behavior.
Running this query (https://gitlab.com/snippets/1889668) makes sense only for the Postgres master. So we can wrap it to pg_is_in_recovery()
check, not to run it on replicas, but have it everywhere.
Edited by Nikolay Samokhvalov