Geo Framework: Persistent failures block/slow-down retries of transient failures

A small proportion of failed registries have extremely high retry_count:
irb(main):085:0> Geo::SnippetRepositoryRegistry.needs_sync_again.where('retry_count > 40').count
=> 866
I think this is because:

There are persistent failures in Geo staging

We neglected to ORDER BY the relation used in find_registries_needs_sync_again. I assume the order defaults to ID ASC, which would give low IDs first priority.

I think we need to order the needs_sync_again scope: https://gitlab.com/gitlab-org/gitlab/-/blob/v13.6.3-ee/ee/app/models/concerns/geo/replicable_registry.rb#L39

I'll open a follow up issue for this. It's an existing issue for the whole Geo framework when there are persistent failures.

To do

Add order() to needs_sync_again or where it's used (find_registries_needs_sync_again) to get a batch
Generate query plan of impacted queries-- for each impacted replicator (package files, MR diffs, terraform state versions, and snippet repositories)
Add or modify indexes for bad query plans

Edited Feb 09, 2021 by Michael Kozono