Geo Framework: Persistent failures block/slow-down retries of transient failures
A small proportion of failed registries have extremely high
retry_count
:irb(main):085:0> Geo::SnippetRepositoryRegistry.needs_sync_again.where('retry_count > 40').count => 866
I think this is because:
- There are persistent failures in Geo staging
- We neglected to
ORDER BY
the relation used infind_registries_needs_sync_again
. I assume the order defaults to ID ASC, which would give low IDs first priority.I think we need to order the
needs_sync_again
scope: https://gitlab.com/gitlab-org/gitlab/-/blob/v13.6.3-ee/ee/app/models/concerns/geo/replicable_registry.rb#L39I'll open a follow up issue for this. It's an existing issue for the whole Geo framework when there are persistent failures.
To do
-
Add order()
toneeds_sync_again
or where it's used (find_registries_needs_sync_again
) to get a batch -
Generate query plan of impacted queries-- for each impacted replicator (package files, MR diffs, terraform state versions, and snippet repositories) -
Add or modify indexes for bad query plans
Edited by Michael Kozono