Skip to content

Geo Framework: Persistent failures block/slow-down retries of transient failures

A small proportion of failed registries have extremely high retry_count:

irb(main):085:0> Geo::SnippetRepositoryRegistry.needs_sync_again.where('retry_count > 40').count
=> 866

I think this is because:

  1. There are persistent failures in Geo staging
  2. We neglected to ORDER BY the relation used in find_registries_needs_sync_again. I assume the order defaults to ID ASC, which would give low IDs first priority.

I think we need to order the needs_sync_again scope: https://gitlab.com/gitlab-org/gitlab/-/blob/v13.6.3-ee/ee/app/models/concerns/geo/replicable_registry.rb#L39

I'll open a follow up issue for this. It's an existing issue for the whole Geo framework when there are persistent failures.

To do

  • Add order() to needs_sync_again or where it's used (find_registries_needs_sync_again) to get a batch
  • Generate query plan of impacted queries-- for each impacted replicator (package files, MR diffs, terraform state versions, and snippet repositories)
  • Add or modify indexes for bad query plans
Edited by Michael Kozono