C6 Migration Query Timeout Bug Follow Up Work
Prior Issues for Context
- gitlab-com/gl-infra/gitlab-dedicated/team#4159 -- the GitLab Dedicated team's tracking issue for this bug incident
- #442259 (closed) -- the Geo team's prior bug investigation for a related query timeout issue
Problem Context
In preparation for customer C6’s migration from self-hosted to dedicated managed GitLab, dedicated engineers have been running into a query timeout problem when running rake geo:status
, which caused the rake task overall to fail.
Analyzing the queries running on Postgres (in our Geo tracking schema) during the rake task indicates this was the actual query that was timing out (this isn’t an exact copy and paste, but was hand typed by Simon, so be aware this may not be the exact text to match):
select max(job_artifact_registry.id) from job_artifact_registry where state in (3);
Currently, we have the following indices on job_artifact_registry
:
CREATE INDEX index_job_artifact_registry_on_artifact_id ON job_artifact_registry USING btree (artifact_id);
CREATE INDEX index_job_artifact_registry_on_retry_at ON job_artifact_registry USING btree (retry_at);
CREATE INDEX job_artifact_registry_failed_verification ON job_artifact_registry USING btree (verification_retry_at NULLS FIRST) WHERE ((state = 2) AND (verification_state = 3));
CREATE INDEX job_artifact_registry_needs_verification ON job_artifact_registry USING btree (verification_state) WHERE ((state = 2) AND (verification_state = ANY (ARRAY[0, 3])));
CREATE INDEX job_artifact_registry_pending_verification ON job_artifact_registry USING btree (verified_at NULLS FIRST) WHERE ((state = 2) AND (verification_state = 0));
So it does not have any direct index on state
alone and all of our indices to make the pagination queries we do against the job_artifact_registry
have partial conditions on them that preclude them from use in this specific query.
@stomlinson was able to resolve this bug directly by adding a new index for C6's geo database: CREATE INDEX CONCURRENTLY ON job_artifact_registry (state, id)
. However, this isn't in our Geo schema right now so the problem will persist for other similar situations. I am opening this issue to follow up and correct our indices on the job_artifact_registry
table.
Follow Up Tasks
At a minimum, we need to add a simple index on state
to job_artifact_registry
. There is pagination code somewhere in our base that is trying to get min and max values for each possible state value on the table. (@mkozono if you have more specific information where these queries may be kicking off, please add it in the comments).
We may also want to identify where in the code these pagination queries happen and consider whether we should instead create a single compound index to solve all of these problems CREATE INDEX ON job_artifact_registry (state, verification_state, id)
(see aformentioned issue #442259 (closed) for more details on why we currently have these partial indexes job_artifact_registry_failed_verification
, job_artifact_registry_needs_verification
, and job_artifact_registry_pending_verification
. If the single compound index resolves the problem, we should also consider removing these partial indices.
Mike: The historical context is these indexes were not created for these count queries. Instead, they were created to support concurrent verification. E.g. to support https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/workers/geo/verification_batch_worker.rb