Skip to content

Zoekt metrics from MetricsUpdateCronWorker are inaccurate

Background

While looking at the search_zoekt_task_processing_queue_size prometheus metric in gitlab-com/runbooks!8560 (comment 2372752094) we discovered some surprising behaviour: old values of the metric appear to linger.

Analysis

For this query, we'd expect the value to be the same across all reporting pods, but it is wildly different:

sum by (pod) (search_zoekt_task_processing_queue_size{env="gprd",node_name="gitlab-gitlab-zoekt-0"})

image

The metric sticks to a value for minutes at a time, sometimes hours. Why does this happen?

This metric is implemented using sidekiq-cron with a job that is scheduled once per minute, running Search::Zoekt::MetricsUpdateCronWorker. There are two behaviours that contribute to this result:

  • Prometheus metrics persist at the pod level.
  • When you set a gauge, it will stay set to that value until the value is updated or the pod terminates.

This means that once per minute we run this worker, most of the time on a new pod. So now instead of a single metric showing the latest value, we have N metrics (N = active pods that have run at least one of these cron jobs) each reporting a point-in-time snapshot of the last value they saw.

Conclusion

MetricsUpdateCronWorker is not a sound pattern.

We could work around this a little bit by implementing some expiry logic that removes the gauge after some time (we should not reset to 0, as this would be incorrect). But it still would be quite a hack.

We could also consider pushgateway but that's also very hacky and has its own caveats, so probably not a great option.

Proposal

IMO this needs to be implemented as a proper exporter that fetches the latest value every time. In the case of zoekt, I would propose that the zoekt nodes export this metric on the server-side, and we remove the client-side measurement.

In the case of Elastic (where we read Gitlab::CurrentSettings.attributes), we may need to move this to gitlab-exporter or find some other solution. The Elastic case is less critical though, as those values change much less frequently.

Edited by Igor