PostgreSQL extension logerrors for better observability (error counts)

In Postgres, there is a problem: it's not easy to answer the question "How many errors of various types do we have?". To understand it, usually we need to process logs, which is inconvenient and fragile.

There is a small extension, created by guys from Yandex.Cloud and battle-tested there (thousands of Postgres clusters): logerrors.

I propose considering it (of course, after thorough testing), perhaps gathering metrics with Prometheus.

I've checked it for Postgres 11 on Ubuntu, here is a basic example:

test=# select pg_log_errors_reset();
 pg_log_errors_reset
---------------------

(1 row)

test=# blabla;
ERROR:  syntax error at or near "blabla"
LINE 1: blabla;
        ^
test=# select 1/0;
ERROR:  division by zero
test=# select 1/0;
ERROR:  division by zero
test=# create extension dummy;
ERROR:  could not open extension control file "/usr/share/postgresql/11/extension/dummy.control": No such file or directory
test=#
test=# select * from pg_log_errors_stats();
 time_interval |  type   |         message          | count | username | database
---------------+---------+--------------------------+-------+----------+----------
               | WARNING | TOTAL                    |     0 |          |
               | ERROR   | TOTAL                    |     4 |          |
               | FATAL   | TOTAL                    |     0 |          |
           600 | ERROR   | ERRCODE_SYNTAX_ERROR     |     1 | postgres | test
           600 | ERROR   | ERRCODE_DIVISION_BY_ZERO |     2 | postgres | test
           600 | ERROR   | ERRCODE_UNDEFINED_FILE   |     1 | postgres | test
(6 rows)

message values starting with ERRCODE_ correspond to Postgres standard error codes: https://www.postgresql.org/docs/11/errcodes-appendix.html.

With this tiny extension, we could better monitor the database health, including both system errors (this includes data_corrupted and index_corrupted codes so we could benefit from using it and the ongoing effort on turning on data checksums) and application-level issues.

cc @Finotto @gitlab-com/gl-infra/sre-datastores

Edited Nov 11, 2020 by Nikolay Samokhvalov