Geo: Batch count query statement timeout
Problem
In a secondary Geo site with ~40M job artifacts, Geo::MetricsUpdateWorker
jobs are erroring with PG::QueryCanceled: ERROR: canceling statement due to statement timeout
.
rake geo:status
also fails with the same error, so we are flying blind with respect to Geo replication/verification status, but we need this in order to be ready to failover to the secondary site.
The logs shows the failure in Geo::VerifiableReplicator.verification_failed_count
in Gitlab::Database::BatchCounter#actual_start
, during this SQL query: SELECT MIN("job_artifact_registry"."id") FROM "job_artifact_registry" WHERE ("job_artifact_registry"."state" IN (2)) AND ("job_artifact_registry"."verification_state" IN (3))
This index is the one I expect would be chosen: CREATE INDEX job_artifact_registry_failed_verification ON job_artifact_registry USING btree (verification_retry_at NULLS FIRST) WHERE ((state = 2) AND (verification_state = 3));
We use "batch count" logic.
EXPLAIN ANALYZE
of SELECT MIN
timed out. Here is EXPLAIN
without the ANALYZE
:
gitlabhq_geo_production=> EXPLAIN SELECT MIN("job_artifact_registry"."id") FROM "job_artifact_registry" WHERE ("job_artifact_registry"."state" IN (2)) AND ("job_artifact_registry"."verification_state" IN (3));
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Result (cost=1.81..1.82 rows=1 width=4)
InitPlan 1 (returns $0)
-> Limit (cost=0.56..1.81 rows=1 width=4)
-> Index Scan using job_artifact_registry_pkey on job_artifact_registry (cost=0.56..5983372.31 rows=4795875 width=4)
Index Cond: (id IS NOT NULL)
Filter: ((state = 2) AND (verification_state = 3))
(6 rows)
It looks like the MIN
query chose a bad index (job_artifact_registry_pkey
). It has to run through every single ID and filter out every one, since almost no records match.