Create default prometheus alert rules for Geo metrics
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Description
When a secondary node fails to sync for any reason, we should warn the admin that the sync has failed, rather than hope they discover the problem by chance.
Proposal
Some amount of sync failures are normal, so perhaps as a starting point, we should only alert if more than 2% of total replicables are failing to sync.
This alert rule probably belongs in https://gitlab.com/gitlab-org/omnibus-gitlab/-/blob/master/files/gitlab-cookbooks/monitoring/templates/rules/gitlab.rules
I'm not sure if it's a concern, but this rule must not cause a problem if Geo is not enabled, for example.
References
- https://docs.gitlab.com/ee/administration/monitoring/prometheus/
- https://prometheus.io/docs/alerting/latest/configuration/
Implementation Plan
Short term - Documentation Update
- Document some useful prometheus rules which can be used for Geo - a good starting point would be to list those currently used on Gitlab.com/Dedicated as examples - https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/tenant-observability-config/-/blob/main/metrics-catalog/mixins/geo/rules/rules-geo.json - we can add the new documentation section for Geo Prometheus Rules in https://docs.gitlab.com/administration/geo/
- It might be helpful to also link to the available metrics (all prefixed with
geo_in this list - https://docs.gitlab.com/administration/monitoring/prometheus/gitlab_metrics/
Long term - Runbook Update (Separate issue here - #526025)
- Update our runbooks to include the rules above by default when deploying a GitLab instance as per the discussion here - #1816 (comment 2394064221)
Edited by 🤖 GitLab Bot 🤖