Skip to content

Improve monitoring for postgres transaction ID wraparound

Transaction ID wraparound is a very unpleasant event (e.g.: https://blog.sentry.io/2015/07/23/transaction-id-wraparound-in-postgres). We currently monitor for it using a tool, and generate a weekly report: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9352#postgres-checkup_F002.

We should replace this with a prometheus alerting rule, that monitors for txid space exhaustion, presumably either using a percentage exhaustion threshold, predict_linear, or both. That way, we maintain the simplicitly of using prometheus+alertmanager as our alerting platform, with no out-of-band tooling to memorize.

This may or may not require adding a feature to the postgres exporter, depending on whether this information is already exposed.

See also https://gitlab.slack.com/archives/C3NBYFJ6N/p1587082809209300, which got brought up in the Postgres training.

cc @abrandl @NikolayS @Finotto @bjk-gitlab

@AnthonySandoval for triage help - is this ~"team::Observability" or Database?