Skip to content

Add Prometheus recording rules for Usage Ping

Matthias Käppler requested to merge mk/usage-ping-recording-rules into master

In &3209 we are looking to collect data from self-managed Omnibus installations about how they are deployed and scaled, how much memory is available and expended etc.

We settled on the bundled Prometheus as the best source for this, since most of our components already export useful metrics to it.

For 13.1 we are currently sending queries to the bundled Prometheus ad-hoc as part of the Usage Ping worker job: https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/usage_data_concerns/topology.rb

Now we are looking to move these to recording rules instead that run for all bundled Prometheus installations.

This has several advantages:

  • Better separation of concerns. We can treat recorded metrics as a sort of contract between the application as the client and Prometheus as the server. We can go in and change the sometimes complex queries without having to change the client code, which makes for a better separation of concerns (application logic vs data/storage logic.)
  • It makes it easier to unit test queries. Since the best we can currently do is to use string matchers for testing Prometheus queries at the client side, with complex queries this creates tight coupling between the test and the subject. Querying for recorded metrics is much simpler, which makes the interaction simpler to test as well.
  • It is more performant. The main argument for recording rules is that since they are a sort of mini-ETL, the data is readily available to be queried instead of having to be aggregated on demand, increasing the likelihood of the query to complete quickly and successfully (e.g. not timing out.) We had observed some queries taking several seconds when run against .com data.

Note that these rules have to evaluate relatively infrequently, since we only send the Usage Ping once every week.

North Star Metric

Additionally, I introduced gitlab_usage_ping:gitlab_workhorse_http_requests:hourly_rate1w which we did not have before. This is meant to be the 7 week rolling average of hourly request volume going to our application servers. It is used to compute our team NSM:

Requests / Hour (Rolling 7 day average) / ( (GB of RAM * 0.004237) + (Cores * 0.031611) )

Open questions

I wasn't sure where to add these, since we have gitlab.rules and node.rules. These metrics are a bit of both: application related metrics and node related metrics. I put them in gitlab.rules because I see them as being part of the application, after all Usage Ping is a feature of sorts. But I did move them into their own group.

References

Relates to:

Edited by GitLab Release Tools Bot

Merge request reports