Server-side backup metrics
Add prometheus metrics to keep track of server-side backups.
Proposed metrics
The table below outlines some metrics we can consider adding. These were tested locally via the GDK. For each metric, we can also track gl_project_path as a label attribute to identify particularly large or troublesome repositories.
Metric | Example | Notes |
---|---|---|
Backup duration by phase |
A rolling average rate of each phase of a backup. Backups have four phases:
|
|
|
Rate of RPC responses grouped by response code.
|
|
|
A rolling average rate of response time for the RPC, which pretty much translates to the actual time taken to perform a backup of a single repository. | |
Bundle upload rate | Upload rate in MB/s of bundle files into object storage. | |
Bundle uploads by size |
Persistent count of bundles uploaded by size. Each row represents the number of bundles uploaded with a size within that bucket. e.g. 186 bundles <10MB were uploaded. Not sure how useful this graph will be in practice. |
Implementation plan
-
Add Prometheus observers to the backup
package to collect metrics during a backup run. MR: !6540 (merged) -
Once deployed, perform test queries against Canary and Production using Thanos. -
Add new graphs to the main and host detail dashboards. The graphs can be defined in a separate libsonnet file (similar to adaptive limits) and shared. MR: gitlab-com/runbooks!6610 (merged) -
Add a new section to the monitoring documentation covering the new backup metrics. MR: gitlab!138883 (merged)
Edited by James Liu