Geo: Ensure only one MetricsUpdateWorker runs at a time
What does this MR do and why?
Extend the lease timeout of a job which can possibly take longer than 5 minutes. To avoid compounding a performance problem.
The main downside I see is if a metrics job is somehow lost, then it can take up to an hour for a new one to start. StandardError
s are rescued by ExclusiveLeaseGuard
and the lease is released in that case, so this should be rare.
Related to #370158 (comment 1056001661)
Diagnosing the problem that this addresses
If you suspect you have a long-running Geo::MetricsUpdateWorker
, you can confirm this by outputting the duration of these completed jobs with:
sudo grep 'Geo::MetricsUpdateWorker.*done:' /var/log/gitlab/sidekiq/current | jq '.duration_s'
Example partial snippet of output:
5.750802
7.153363
6.536177
6.284856
6.092997
6.343974
In this example, the job takes around 6 seconds.
If yours takes significantly more than 300 seconds, then you might end up with more than one of these jobs running at a time (prior to this MR). For example, if yours takes 600 seconds, then you are likely to have 2 jobs running at a time. At some point, jobs will error out due to query timeouts, and in this case you should see errors in Sentry and the logs.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.