Discuss alternative for measuring "global metrics" in the rails application with prometheus
When building the gitlab_maintenance_mode
metric, we opted to use Sidekiq Exporter in order to send the metric to prometheus. The choice for using Sidekiq Exporter is because the metric does not change based on any other context like the request, a specific project/user etc. For all that it matters, it is "global" to the application itself.
@reprazent explained the shortcomings of that approach below:
The scrape interval is every 15s, we use the default
query.lookback_delta
of 5 minutes. This means that a value will disappear if it wasn't scraped in the last 5 minutes. One thing to consider with these kinds of gauges that we allow to be set from one pod exclusively is the following:
00:00
pod_1
sets the gauge to 1.Available metrics:
gitlab_maintenance_mode{pod="pod_1"} 1
00:03
pod_2
sets the gauge to 0.Available metrics: When building the gitlab_maintenance_mode metric, we opted for exporting it through Sidekiq exporter. See: !114981 (merged)
@reprezent
gitlab_maintenance_mode{pod="pod_1"} 1 gitlab_maintenance_mode{pod="pod_2"} 0
00:06 scrape information:
gitlab_maintenance_mode{pod="pod_2"} 0
So this means for a brief moment in time, we don't know which of those values is correct. For a maintenance mode flag that changes very infrequently, this might be fine, we could query around it using a
max
and take the worst case report. But I don't think this approach can be a blanket solution for other kinds of state that we want to gather from the application into metrics.A better approach could be to have this information scraped through an exporter (for example https://gitlab.com/gitlab-org/gitlab-exporter/), that reads the information directly from a datasource on a scrape.
Or we could consider pushing the metrics into Prometheus through push-gateway: https://prometheus.io/docs/practices/pushing/
I wonder if 1 minute is too frequent.
From the of the lookback delta, I don't think we need to run this job every minute, anything less than 5 minutes should be fine.
Because you can have more than one machine running Sidekiq with the exporter turned on, when you switch the maintenance state, you may get for a small period of time the previous and the new value in prometheus, based on which machine gathered the metric. This should cleanup and fix itself as more by the minute executions happen, but for that brief period of time you may not be sure what the actual state is.
@reprazent clarifies below the push-gateway suggestion:
Just for clarification. Push-gateway is a prometheus-ism (https://prometheus.io/docs/practices/pushing/) and it's not generally recommended.
gitlab-exporter
is a GitLab-component, that exports gitlab-specific metrics and is scraped by prometheus.It is already querying the
database
, andredis-sidekiq
. So perhaps setting all of these kind of metrics together in a hash in redis, and then exporting that using gitlab-exporter is a valid approach? I think that's slightly nicer than the push-approach.I'm not sure dedicated instances are running GitLab-exporter, so I think the approach here is your best bet to get unblocked quickly.
I've proposed we consider using gitlab-exporter
as a push gateway as well, so in addition to the ad-hoc queries it already does inside to extract postgresql metrics, gitlab-rail
s should push to it whatever "global metric" it needs to expose.
Having gitlab-exporter
instead, retrieve that type of data would require specific knowledge of the implementation which would create a hard coupling from gitlab-exporter
to specific version/commit in gitlab-rails
which is not ideal. So it seems using the push-gateway approach is the best direction.