Create default prometheus alert rules for Geo metrics

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Description

When a secondary node fails to sync for any reason, we should warn the admin that the sync has failed, rather than hope they discover the problem by chance.

Proposal

Some amount of sync failures are normal, so perhaps as a starting point, we should only alert if more than 2% of total replicables are failing to sync.

This alert rule probably belongs in https://gitlab.com/gitlab-org/omnibus-gitlab/-/blob/master/files/gitlab-cookbooks/monitoring/templates/rules/gitlab.rules

I'm not sure if it's a concern, but this rule must not cause a problem if Geo is not enabled, for example.

References

Implementation Plan

Short term - Documentation Update

  1. Document some useful prometheus rules which can be used for Geo - a good starting point would be to list those currently used on Gitlab.com/Dedicated as examples - https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/tenant-observability-config/-/blob/main/metrics-catalog/mixins/geo/rules/rules-geo.json - we can add the new documentation section for Geo Prometheus Rules in https://docs.gitlab.com/administration/geo/
  2. It might be helpful to also link to the available metrics (all prefixed with geo_ in this list - https://docs.gitlab.com/administration/monitoring/prometheus/gitlab_metrics/

Long term - Runbook Update (Separate issue here - #526025)

  1. Update our runbooks to include the rules above by default when deploying a GitLab instance as per the discussion here - #1816 (comment 2394064221)
Edited by 🤖 GitLab Bot 🤖