Monitor and alert corruption-related error messages in Postgres logs
(related discussions: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/11875)
Error messages in Postgres logs have error codes according to this table: https://www.postgresql.org/docs/current/errcodes-appendix.html
It is worth having a special alert on all these:
| Class XX — Internal Error | |
|---|---|
| XX000 | internal_error |
| XX001 | data_corrupted |
| XX002 | index_corrupted |
Reasoning:
- When corruption happens it should be investigated as early as possible
- Without such alerts, there are risks to have a serious issue left unnoticed – e.g., during PG 12->14 upgrade for gprd-ci, we had an incident (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/15925) that was noticed only when manually inspecting the logs, there were significant risks to overlook it
- although, on the other hand, if we put "silence", such an alert would be unnoticed – how to approach this problem?
- I think even a single error with such code should be analyzed.