Geo::RepositoryVerification::Secondary::ShardWorker lease key is not namespaced by shard name
Summary
When GitLab Geo verification jobs run, we schedule one job per repository shard. The job that runs first, acquires an exclusive lease (global lock), and all the other jobs related to distinct shard fails to acquire the lock.
I've suspected initially this was due to a previous misbehaved sidekiq process that was not terminated correctly and kept the leases acquired (until TTL expires), but after inspecting the source code I discovered that the lease key is actually the same for all of them.
We haven't seen that in gitlab dot com during the GCP migration, as frequent as with this particular customer, probably because of the volume of jobs running on our infra, as there is always more queued operations than jobs to execute than, that helps to make sure they eventually acquire the lock and catch up.
Customer affected: https://gitlab.zendesk.com/agent/tickets/103488
Steps to reproduce
To easily reproduce on terminal, just run: Geo::RepositoryVerification::Secondary::ShardWorker.new.lease_key => "geo/repository_verification/secondary/shard_worker"
To simulate in a real environment, you need at least two repository storages defined in configuration file and wait for the Geo::RepositoryVerification::Secondary::ShardWorker
to run.
What is the current bug behavior?
You will see something like the output below on the /var/log/gitlab/gitlab-rails/geo.log
file:
{"severity":"INFO","time":"2018-09-24T18:45:19.049Z","class":"Geo::RepositoryVerification::Secondary::ShardWorker","message":"Started scheduler","job_id":"2272809ac343621d5aa420eb","shard":"repos2"}
{"severity":"ERROR","time":"2018-09-24T18:45:19.071Z","class":"Geo::RepositoryVerification::Secondary::ShardWorker","message":"Cannot obtain an exclusive lease. There must be another instance already in execution.","job_id":"389bbf14d0680c580e4147ed","shard":"default"}
{"severity":"ERROR","time":"2018-09-24T18:45:19.072Z","class":"Geo::RepositoryVerification::Secondary::ShardWorker","message":"Cannot obtain an exclusive lease. There must be another instance already in execution.","job_id":"5853eeb3cf4ed6af1b8b93ab","shard":"repos3"}
{"severity":"ERROR","time":"2018-09-24T18:45:19.086Z","class":"Geo::RepositoryVerification::Secondary::ShardWorker","message":"Cannot obtain an exclusive lease. There must be another instance already in execution.","job_id":"335433d231562d1476cd1927","shard":"repos4"}
In the example log file above you can see that the first job to execute, acquired the lock for repos2
repository shard. all the others failed.
What is the expected correct behavior?
All four should have acquired the lock and printed "started scheduler".