add alerts to detect for query optimiser issues in postgres

Adds an alert to detect the type of incident we saw in gitlab-com/gl-infra/production#2885 (closed) and gitlab-com/gl-infra/production#3875 (closed).

Also adds a dashboard with visualizations which may help in diagnosing the problem. The alert will include a link to dashboard, with template vars set to the appropriate results.

The logic for this alert is as follows:

  1. If more than 50% of tuple fetches on a the primary instance are for a single table for more than 3 minutes, alert
  2. If more than 50% of tuple fetches on all the replica instances aggregated together are for a single table for more than 5 minutes, alert.

image

https://dashboards.gitlab.net/d/alerts-postgres_user_table_alerts/alerts-postgres-user-table-alerts?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-fqdn=patroni-03-db-gprd.c.gitlab-production.internal&var-relname=namespaces

Edited by Andrew Newdigate

Merge request reports

Loading