Create row count metric and alert for stalled (pre)imports during registry migration phase 2
Problem
(Pre)import completion notifications from the registry to Rails can go missing and in that case, Rails will be left in the dark in regards to completions. This is especially critical if we're dealing with missing final import completion notifications, we a repository is kept in read-only mode during a final import.
We need a way to monitor this specific event and alert us if the count of "stalled" (at least from the Rails perspective) migrations goes above a given threshold and/or keep above if for longer than a given amount of time.
Solution
Prometheus
Create two new row count metrics on top of the Rails container_repositories
database table, one for counting potentially stalled pre-imports and another for final imports. This requires a change in gitlab-exporter
and chef-repo
, similar to what was done in #356286 (closed), which will take care of generating the respective Prometheus metrics.
Grafana
We should then add a graph for the resulting metrics to the registry migration dashboard in Grafana (similar to gitlab-com/runbooks!4482 (merged)).
severity4 incident)
Alerts (Slack andFinally, we should create an alert in the runbooks
project so that if each count goes above N
for longer than Y
we'll get a notification on Slack (#alerts and #feed_alerts-general channels) and an S4 incident is automatically created for investigation.
See https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/monitoring/alerts_manual.md for instructions.
Here is a very similar example: gitlab-com/runbooks!3966 (diffs).