Skip to content

Geo: Repository verification gets stuck when all projects have been verified at least once

Summary

Doing some investigation in the repository verification on staging (gitlab-com/migration#426 (closed)), I noticed that the same projects were constantly getting selected to be re-verified over and over on the primary node.

Once we have verified all projects for the first time we start looking for projects that have an outdated checksum. To get the outdated projects to re-verify, we find projects that have an empty repository or wiki checksum ignoring repositories/wikis that have previously verification has failed, and limit it to a batch size of 1000:

SELECT
    "projects"."id"
FROM
    "projects"
    INNER JOIN "project_repository_states" ON "project_repository_states"."project_id" = "projects"."id"
    WHERE ("project_repository_states"."repository_verification_checksum" IS NULL
        AND "project_repository_states"."last_repository_verification_failure" IS NULL
        OR "project_repository_states"."wiki_verification_checksum" IS NULL
        AND "project_repository_states"."last_wiki_verification_failure" IS NULL)
ORDER BY
    projects.last_repository_updated_at ASC NULLS LAST
LIMIT 1000;

The problem with the query above is that we do not filter projects that have wiki disabled:

[ stg ] production> finder = Geo::RepositoryVerificationFinder.new
[ stg ] production> project = finder.find_outdated_projects(batch_size: 1000).take
=> #<Project id:3063907>
[ stg ] production> project.repository_state.repository_checksum_outdated?
=> false
[ stg ] production> project.repository_state.wiki_checksum_outdated?
=> false
[ stg ] production> project.repository_state
=> #<ProjectRepositoryState id: 4, project_id: 3063907, repository_verification_checksum: "0000000000000000000000000000000000000000", wiki_verification_checksum: nil, last_repository_verification_failure: nil, last_wiki_verification_failure: nil>
[ stg ] production> project.wiki_enabled?
=> false

When we have 1000 projects with wiki disabled, we'll always query the same 1000 projects, never moving forward. This happens because we skip the checksum verification for the wiki repository when the project has wiki disabled:

https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/app/workers/geo/repository_verification/primary/single_worker.rb#L22

def perform(project_id)
  return unless Gitlab::Geo.primary?

  @project = Project.find_by(id: project_id)
  return if project.nil? || project.pending_delete?

  try_obtain_lease do
    calculate_repository_checksum if repository_state.repository_checksum_outdated?
    calculate_wiki_checksum if repository_state.wiki_checksum_outdated? # <= HERE
  end
end

https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/app/models/project_repository_state.rb#L26-30

def wiki_checksum_outdated?
  return false unless project.wiki_enabled?

  wiki_verification_checksum.nil?
end

Possible fixes

We have two ways to solve this:

  1. Filter out projects that have wiki disabled when querying the database for projects that checksum is outdated.
  2. Remove the guard clause from wiki_checksum_outdated? this way we will give them the dummy checksum. We should also clear the wiki checksum when the project have the wiki re-enabled.

I think that the first option can increase the time that query runs on the primary node and add a potential data loss because we don't remove the repository from the disk when we disable the wiki for a project. I vote for the latter this way we can ensure both nodes have the same content.