Rails metrics cardinality review

We should carry out a review of the metrics GitLab-Rails emits and consider whether there are optimisations we can make.

The gitlab-rails job on the web service emits by far the most application metrics: each instance returns 40k metrics on every scrape.

image

source

Of these, the following four metrics dominate, generating about 60% of the metrics:

  • gitlab_sql_duration_seconds_bucket
  • gitlab_transaction_duration_seconds_bucket
  • gitlab_transaction_allocated_memory_bytes_bucket
  • gitlab_transaction_cputime_seconds_bucket

image

https://prometheus-app.gprd.gitlab.net/graph?g0.range_input=1h&g0.expr=topk(20%2C%20%0Acount(%7Benvironment%3D%22gprd%22%2Cfqdn%3D%22web-04-sv-gprd.c.gitlab-production.internal%22%2Cinstance%3D%22web-04-sv-gprd.c.gitlab-production.internal%3A8083%22%2Cjob%3D%22gitlab-rails%22%2Cshard%3D%22default%22%2Cstage%3D%22main%22%2Ctier%3D%22sv%22%2Ctype%3D%22web%22%7D)%20by%20(name))&g0.tab=1

Reducing the volume of metrics would have many positive impacts on GitLab:

  1. Less compute time spent in instrumentation
  2. Complications from GIL lockups etc minimized
  3. A more stable observability platform