Server-side backup metrics

Add prometheus metrics to keep track of server-side backups.

Proposed metrics

The table below outlines some metrics we can consider adding. These were tested locally via the GDK. For each metric, we can also track gl_project_path as a label attribute to identify particularly large or troublesome repositories.

Metric	Example	Notes
Backup duration by phase		A rolling average rate of each phase of a backup. Backups have four phases: writing refs writing the bundle writing custom hooks committing the manifest
`BackupRepository` RPC response codes		Rate of RPC responses grouped by response code. `BackupRepository` emits the following codes: OK NotFound (for skipped backups) Internal (for errors)
`BackupRepository` RPC response time		A rolling average rate of response time for the RPC, which pretty much translates to the actual time taken to perform a backup of a single repository.
Bundle upload rate		Upload rate in MB/s of bundle files into object storage.
Bundle uploads by size		Persistent count of bundles uploaded by size. Each row represents the number of bundles uploaded with a size within that bucket. e.g. 186 bundles <10MB were uploaded. Not sure how useful this graph will be in practice.

Implementation plan

Add Prometheus observers to the backup package to collect metrics during a backup run. MR: !6540 (merged)
Once deployed, perform test queries against Canary and Production using Thanos.
Add new graphs to the main and host detail dashboards. The graphs can be defined in a separate libsonnet file (similar to adaptive limits) and shared. MR: gitlab-com/runbooks!6610 (merged)
Add a new section to the monitoring documentation covering the new backup metrics. MR: gitlab!138883 (merged)

Edited Dec 06, 2023 by James Liu

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information