Geo: Replication and checksumming stops for a whole repository storage

Problem

From #389996 (closed) in geo.log:

Checksumming of project repositories is exiting early without doing any checksumming:

2023-01-31T21:41:03.400Z: {:message=>"Excluding unhealthy shards", :failed_checks=>[{:status=>"failed", :message=>"unavailable repositories: repositories unavailable", :labels=>{:shard=>"default"}}], :class=>"Geo::RepositoryVerification::Primary::BatchWorker"}

Confirmed in Rails console:

$ Gitlab::HealthChecks::GitalyCheck.readiness
=> [#<struct Gitlab::HealthChecks::Result name="gitaly_check", success=false, message="unavailable repositories: repositories unavailable", labels={:shard=>"default"}>]

Then in a comment:

Does it make sense that we stop checksumming repos in the background when a portion of repos are unavailable? Or would it make more sense to push the problem down to a more granular level: continue as normal and let verification failure logic mark the repo for retry (with progressive backoff)?

I think the checksumming should continue. This is an internal issue in Gitaly Cluster with a given repository and shouldn't affect actions on other repositories. Marking a failure and retrying sounds like the right action for Geo.

It looks like this filtering of "unhealthy shards" is used by these workers:

Geo::RepositoryVerification::Primary::BatchWorker
Geo::RepositoryVerification::Secondary::SchedulerWorker
Geo::RepositorySyncWorker
RepositoryCheck::DispatchWorker

Since Geo::RepositorySyncWorker, it looks like this can also block backfilling of project repos, which is a data loss risk :(

Proposal

Don't run Gitlab::HealthChecks::GitalyCheck.readiness, instead, return all shards.

Question

Is there another shard check which would be useful to use to exclude truly dead shards?

I think in the case of both Verification workers and the Sync worker, don't need to exclude even truly dead shards. This behavior short-circuits our normal failure handling and is confusing to diagnose.

Diagnosing

If checksumming/verification/replication is not occurring for all repos on one repo storage, then you should suspect this issue.

You can disconfirm this issue by running gitlab-rake gitlab:gitaly:check on all Geo sites. If it returns OK, then you are not affected by this issue.

If gitlab-rake gitlab:gitaly:check returns an error, then you can confirm this particular issue by finding the Excluding unhealthy shards error message, e.g.: grep "Excluding unhealthy shards" /var/log/gitlab/gitlab-rails/geo.log | less.

Note that the change to Gitaly readiness which this error more likely was introduced in %15.4. So this issue is less likely to occur prior to that version. But, if you are on a prior version and you do see Excluding unhealthy shards, then your next step is also to troubleshoot Gitaly. The root cause is likely to be different than what is described in this issue.

Workaround

You need to make gitlab-rake gitlab:gitaly:check return OK, like:

# gitlab-rake gitlab:gitaly:check
Checking Gitaly ...

Gitaly: ... default ... OK

Checking Gitaly ... Finished

There can be a variety of root causes for this check to return not OK. See Gitaly Troubleshooting.

For example, in #389996 (closed), the 2 unavailable repositories could be removed since they were the result of botched project deletions, and the 300+ unavailable pool repositories could be removed since they were produced by bugs fixed in 14.10.

When the check returns OK, then replication and verification should resume for that repository storage.

Notes

Side note for the dev, the secondary worker applies an additional filter.

Edited Feb 09, 2023 by Michael Kozono