Geo: Replication and checksumming stops for a whole repository storage
Problem
From #389996 (closed) in geo.log
:
Checksumming of project repositories is exiting early without doing any checksumming:
2023-01-31T21:41:03.400Z: {:message=>"Excluding unhealthy shards", :failed_checks=>[{:status=>"failed", :message=>"unavailable repositories: repositories unavailable", :labels=>{:shard=>"default"}}], :class=>"Geo::RepositoryVerification::Primary::BatchWorker"}
Confirmed in Rails console:
$ Gitlab::HealthChecks::GitalyCheck.readiness => [#<struct Gitlab::HealthChecks::Result name="gitaly_check", success=false, message="unavailable repositories: repositories unavailable", labels={:shard=>"default"}>]
Then in a comment:
Does it make sense that we stop checksumming repos in the background when a portion of repos are unavailable? Or would it make more sense to push the problem down to a more granular level: continue as normal and let verification failure logic mark the repo for retry (with progressive backoff)?
I think the checksumming should continue. This is an internal issue in Gitaly Cluster with a given repository and shouldn't affect actions on other repositories. Marking a failure and retrying sounds like the right action for Geo.
It looks like this filtering of "unhealthy shards" is used by these workers:
Geo::RepositoryVerification::Primary::BatchWorker
Geo::RepositoryVerification::Secondary::SchedulerWorker
Geo::RepositorySyncWorker
RepositoryCheck::DispatchWorker
Since Geo::RepositorySyncWorker
, it looks like this can also block backfilling of project repos, which is a data loss risk :(
Proposal
Don't run Gitlab::HealthChecks::GitalyCheck.readiness
, instead, return all shards.
Question
- Is there another shard check which would be useful to use to exclude truly dead shards?
I think in the case of both Verification workers and the Sync worker, don't need to exclude even truly dead shards. This behavior short-circuits our normal failure handling and is confusing to diagnose.
Diagnosing
If checksumming/verification/replication is not occurring for all repos on one repo storage, then you should suspect this issue.
You can disconfirm this issue by running gitlab-rake gitlab:gitaly:check
on all Geo sites. If it returns OK
, then you are not affected by this issue.
If gitlab-rake gitlab:gitaly:check
returns an error, then you can confirm this particular issue by finding the Excluding unhealthy shards
error message, e.g.: grep "Excluding unhealthy shards" /var/log/gitlab/gitlab-rails/geo.log | less
.
Note that the change to Gitaly readiness
which this error more likely was introduced in %15.4. So this issue is less likely to occur prior to that version. But, if you are on a prior version and you do see Excluding unhealthy shards
, then your next step is also to troubleshoot Gitaly. The root cause is likely to be different than what is described in this issue.
Workaround
You need to make gitlab-rake gitlab:gitaly:check
return OK
, like:
# gitlab-rake gitlab:gitaly:check
Checking Gitaly ...
Gitaly: ... default ... OK
Checking Gitaly ... Finished
There can be a variety of root causes for this check to return not OK
. See Gitaly Troubleshooting.
For example, in #389996 (closed), the 2 unavailable repositories could be removed since they were the result of botched project deletions, and the 300+ unavailable pool repositories could be removed since they were produced by bugs fixed in 14.10.
When the check returns OK
, then replication and verification should resume for that repository storage.
Notes
Side note for the dev, the secondary worker applies an additional filter.