Skip to content

Geo: Ensure only one MetricsUpdateWorker runs at a time

Michael Kozono requested to merge mk/extend-geo-metrics-lease-timeout into master

What does this MR do and why?

Extend the lease timeout of a job which can possibly take longer than 5 minutes. To avoid compounding a performance problem.

The main downside I see is if a metrics job is somehow lost, then it can take up to an hour for a new one to start. StandardErrors are rescued by ExclusiveLeaseGuard and the lease is released in that case, so this should be rare.

Related to #370158 (comment 1056001661)

Diagnosing the problem that this addresses

If you suspect you have a long-running Geo::MetricsUpdateWorker, you can confirm this by outputting the duration of these completed jobs with:

  1. sudo grep 'Geo::MetricsUpdateWorker.*done:' /var/log/gitlab/sidekiq/current | jq '.duration_s'

Example partial snippet of output:

5.750802
7.153363
6.536177
6.284856
6.092997
6.343974

In this example, the job takes around 6 seconds.

If yours takes significantly more than 300 seconds, then you might end up with more than one of these jobs running at a time (prior to this MR). For example, if yours takes 600 seconds, then you are likely to have 2 jobs running at a time. At some point, jobs will error out due to query timeouts, and in this case you should see errors in Sentry and the logs.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Michael Kozono

Merge request reports