add alerts to detect for query optimiser issues in postgres (!3317) · Merge requests · GitLab.com / Runbooks

Adds an alert to detect the type of incident we saw in gitlab-com/gl-infra/production#2885 (closed) and gitlab-com/gl-infra/production#3875 (closed).

Also adds a dashboard with visualizations which may help in diagnosing the problem. The alert will include a link to dashboard, with template vars set to the appropriate results.

The logic for this alert is as follows:

If more than 50% of tuple fetches on a the primary instance are for a single table for more than 3 minutes, alert
If more than 50% of tuple fetches on all the replica instances aggregated together are for a single table for more than 5 minutes, alert.

https://dashboards.gitlab.net/d/alerts-postgres_user_table_alerts/alerts-postgres-user-table-alerts?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-fqdn=patroni-03-db-gprd.c.gitlab-production.internal&var-relname=namespaces

Edited Mar 09, 2021 by Andrew Newdigate

add alerts to detect for query optimiser issues in postgres

Merge request reports