Container Registry: Port row count queries from repository cleanup worker to gitlab-exporter
🔥 Problem
In !101918 (merged), we introduced a container registry cleanup worker that will kick delete container repository if necessary.
In addition to that, we use that job to compute metrics around container registry cleanups (source):
- Amount of container repositories waiting to be destroyed.
- Amount of container repositories deletes that are stale (started more than 30 minutes ago).
Those two queries could lead to statement that scan a large amount of rows, which means they can perform poorly at some point and impact the service reliability.
Additionally, these are important performance/reliability indicators, so we should create alerts based on them, which we can't do through logs/Kibana.
Implementation Guide
Port the mentioned row count queries into gitlab-exporter and create alerts for them.
Steps
-
Add row count queries to gitlab-exporter
(example) -
Enable the new query probes in chef-repo for gstg
only (example) -
Create a new Grafana dashboard named registry: Rails detail
. This new dashboard will be used to aggregate all Rails-related metrics for container registry features (example)-
Add a new Repository Cleanup
row to the dashboard. -
Within the new row, add a timeseries graph for each of the row count metrics, pointing them to the corresponding Prometheus metric. You can visualize metrics by pointing to the gstg
environment, where the new query probes are already emitting them.
-
-
Observe metrics in gprd
for a few days. The intent is to gauge the usual/normal values for each of them before we create alerts that should be triggered when those values are crossed. -
Create one custom alert for each metric. This should fire when counts stay above the desired/expected thresholds for an extended time (e.g. 1 hour). Redirect alerts to g_container-registry_alerts
. (example) -
Remove row counts from ContainerRegistry::CleanupWorker
in Rails.