Geo: Batch count query statement timeout

Problem

In a secondary Geo site with ~40M job artifacts, Geo::MetricsUpdateWorker jobs are erroring with PG::QueryCanceled: ERROR: canceling statement due to statement timeout.

rake geo:status also fails with the same error, so we are flying blind with respect to Geo replication/verification status, but we need this in order to be ready to failover to the secondary site.

The logs shows the failure in Geo::VerifiableReplicator.verification_failed_count in Gitlab::Database::BatchCounter#actual_start, during this SQL query: SELECT MIN("job_artifact_registry"."id") FROM "job_artifact_registry" WHERE ("job_artifact_registry"."state" IN (2)) AND ("job_artifact_registry"."verification_state" IN (3))

This index is the one I expect would be chosen: CREATE INDEX job_artifact_registry_failed_verification ON job_artifact_registry USING btree (verification_retry_at NULLS FIRST) WHERE ((state = 2) AND (verification_state = 3));

We use "batch count" logic.

EXPLAIN ANALYZE of SELECT MIN timed out. Here is EXPLAIN without the ANALYZE:

gitlabhq_geo_production=> EXPLAIN SELECT MIN("job_artifact_registry"."id") FROM "job_artifact_registry" WHERE ("job_artifact_registry"."state" IN (2)) AND ("job_artifact_registry"."verification_state" IN (3));
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=1.81..1.82 rows=1 width=4)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.56..1.81 rows=1 width=4)
           ->  Index Scan using job_artifact_registry_pkey on job_artifact_registry  (cost=0.56..5983372.31 rows=4795875 width=4)
                 Index Cond: (id IS NOT NULL)
                 Filter: ((state = 2) AND (verification_state = 3))
(6 rows)

It looks like the MIN query chose a bad index (job_artifact_registry_pkey). It has to run through every single ID and filter out every one, since almost no records match.

Edited Mar 07, 2024 by Michael Kozono