Geo: Repository verification gets stuck when all projects have been verified at least once
Summary
Doing some investigation in the repository verification on staging (gitlab-com/migration#426 (closed)), I noticed that the same projects were constantly getting selected to be re-verified over and over on the primary node.
Once we have verified all projects for the first time we start looking for projects that have an outdated checksum. To get the outdated projects to re-verify, we find projects that have an empty repository or wiki checksum ignoring repositories/wikis that have previously verification has failed, and limit it to a batch size of 1000:
SELECT
"projects"."id"
FROM
"projects"
INNER JOIN "project_repository_states" ON "project_repository_states"."project_id" = "projects"."id"
WHERE ("project_repository_states"."repository_verification_checksum" IS NULL
AND "project_repository_states"."last_repository_verification_failure" IS NULL
OR "project_repository_states"."wiki_verification_checksum" IS NULL
AND "project_repository_states"."last_wiki_verification_failure" IS NULL)
ORDER BY
projects.last_repository_updated_at ASC NULLS LAST
LIMIT 1000;
The problem with the query above is that we do not filter projects that have wiki disabled:
[ stg ] production> finder = Geo::RepositoryVerificationFinder.new
[ stg ] production> project = finder.find_outdated_projects(batch_size: 1000).take
=> #<Project id:3063907>
[ stg ] production> project.repository_state.repository_checksum_outdated?
=> false
[ stg ] production> project.repository_state.wiki_checksum_outdated?
=> false
[ stg ] production> project.repository_state
=> #<ProjectRepositoryState id: 4, project_id: 3063907, repository_verification_checksum: "0000000000000000000000000000000000000000", wiki_verification_checksum: nil, last_repository_verification_failure: nil, last_wiki_verification_failure: nil>
[ stg ] production> project.wiki_enabled?
=> false
When we have 1000 projects with wiki disabled, we'll always query the same 1000 projects, never moving forward. This happens because we skip the checksum verification for the wiki repository when the project has wiki disabled:
def perform(project_id)
return unless Gitlab::Geo.primary?
@project = Project.find_by(id: project_id)
return if project.nil? || project.pending_delete?
try_obtain_lease do
calculate_repository_checksum if repository_state.repository_checksum_outdated?
calculate_wiki_checksum if repository_state.wiki_checksum_outdated? # <= HERE
end
end
https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/app/models/project_repository_state.rb#L26-30
def wiki_checksum_outdated?
return false unless project.wiki_enabled?
wiki_verification_checksum.nil?
end
Possible fixes
We have two ways to solve this:
- Filter out projects that have wiki disabled when querying the database for projects that checksum is outdated.
- Remove the guard clause from
wiki_checksum_outdated?
this way we will give them the dummy checksum. We should also clear the wiki checksum when the project have the wiki re-enabled.
I think that the first option can increase the time that query runs on the primary node and add a potential data loss because we don't remove the repository from the disk when we disable the wiki for a project. I vote for the latter this way we can ensure both nodes have the same content.