Geo replication and verification not keeping up
For the last few weeks, repositories and wikis verifications on our geo secondary for git.drupalcode.org have been gradually falling behind.
- I think this may have started with the 15.5 to 15.6.1 upgrade. Since the falling behind was very gradual, and it was over the holidays, I was monitoring but did not take any measures to correct it.
- Around this upgrade, we stabilized our upgrade process, stopping sidekiq processes early before the upgrade, so long-running jobs can complete in time. We also aligned our upgrade process with the current instructions https://docs.gitlab.com/ee/administration/geo/replication/upgrading_the_geo_sites.html
- After upgrading to 15.7.2, what we are currently running, verifications have fallen further behind, losing maybe a percent a day.
- /admin/geo/sites reports all instances are healthy.
- The primary site's verifications have been 99.9% or higher.
-
sudo gitlab-rake gitlab:geo:check
reports everything is good on both primary and secondary sites. - On the secondary geo site:
- In the tuning settings, we doubled the verification concurrency limit 3 times, currently to 800
- We added a second Sidekiq process, both having 20 threads. We have not seen any Sidekiq backlog.
- It does not appear resource constrained, CPU usage is 10% or less, load average 2-3, ~50% free memory.
- We have 86,500 repositories.
- I've reviewed https://docs.gitlab.com/ee/administration/geo/replication/troubleshooting.html and haven't spotted anything helpful for our situation.
None of these measures seem to have had much of an effect on the secondary's verification queue, it is still gradually falling further behind.
I'd like recommendations for finding potential underlying issues, and ways to get verification caught up.
Replication
There are also around 200 repositories which seem to be stuck in pending synchronization. When we manually synced one in Rails console, it succeeded.
Possible Workarounds
Project and Wiki verification loop
In 15.6 - 15.8, when:
- there are a number of Wikis persistently "Queued" for verification (and probably some Project Repositories persistently "Queued" for verification)
- the Sidekiq logs shows jobs with
"class":"Geo::RepositoryVerification::Secondary::SingleWorker"
running repeatedly with the same"args"
- and especially when most of those projects don't even have Wikis enabled
Then you can temporarily quiet the system by running this command in Rails console to allow the verification jobs to mark them succeeded:
Geo::ProjectRegistry.wikis_checksummed_pending_verification.find_in_batches(batch_size: 20) { |pr_batch| p_ids = pr_batch.pluck(:project_id); pwr_batch = Geo::ProjectWikiRepositoryRegistry.where(project: p_ids).verification_succeeded; puts "projects: #{p_ids.size}, project_wiki_repository_registry: #{pwr_batch.count}"; puts pwr_batch.update_all(verification_checksum: nil); puts Geo::ProjectWikiRepositoryRegistry.mark_as_verification_pending(pwr_batch) };1
It is temporary because Git pushes and new projects will cause wikis to get into this verification state again. The bug is expected to be fixed in releases which include !109882 (merged).
Verification concurrency limit for a single data type
Over time, as Geo added support for many new data types, Verification concurrency limit
became spread over many data types. The way the limit is implemented means that if the bulk of your system's verification work is dominated by a single data type, for example Project Git Repositories, then you may now need to increase Verification concurrency limit
to achieve the same rate of processing for Project Git Repositories. See #387980 (comment 1246735496).