Geo secondary repository verification gets stuck after 1000 failed repositories
Doing some testing on the Geo secondary repository verification, I noticed that the same failed repositories were constantly getting re-verified over and over.
To get the repositories to verify, we grab a list of registries that have either repository_verification_checksum IS NULL
or wiki_verification_checksum IS NULL
, and limit it to a batch size of 1000.
# Find all registries that repository or wiki need verification
# @return [ActiveRecord::Relation<Geo::ProjectRegistry>] list of registries that need verification
def fdw_find_registries_to_verify(batch_size:)
Geo::ProjectRegistry
.joins(fdw_inner_join_repository_state)
.where(
local_registry_table[:repository_verification_checksum].eq(nil).or(
local_registry_table[:wiki_verification_checksum].eq(nil)
)
)
.where(
fdw_repository_state_table[:repository_verification_checksum].not_eq(nil).or(
fdw_repository_state_table[:wiki_verification_checksum].not_eq(nil)
)
).limit(batch_size)
end
Verification failures leave the checksum as NULL. This means that once we have 1000 failed repositories, we'll always query the same 1000 failed repositories, never moving forward.
def load_pending_resources
finder.find_registries_to_verify(batch_size: db_retrieve_batch_size)
.pluck(:id)
end
def schedule_job(registry_id)
job_id = Geo::RepositoryVerification::Secondary::SingleWorker.perform_async(registry_id)
{ id: registry_id, job_id: job_id } if job_id
end
def finder
@finder ||= Geo::ProjectRegistryFinder.new
end
This was one of the original reasons for the last_verification_at
dates, so what we could ensure we didn't keep pulling the same records.