JobArtifactState.checksummed_count query timing out
Summary
GEO is facing timeouts due to a query from the JobArtifactState
table.
To mitigate it we need to apply one (or both) of the following solutions:
- Create a new index on
(verification_state, job_artifact_id, partition_id)
- Modify the query in order to make it more performant on the query planner
The query in question is this one
Steps to reproduce
- Log in to the production console
- Run the commands described here
To confirm if you are affected
This affects GitLab 16.11, 17.0, 17.1, and 17.2.
You can confirm this with:
- The status of a secondary site is no longer updating, e.g. it's more than 1 day old.
- If the status is never more than 1 day old, then it is getting updated at least once a day. It will say
Unhealthy
if the status is more than 10 minutes old, but the 10 minute threshold is arbitrary. In this situation, you should decide for yourself what threshold for status updates is acceptable. This timing depends on the amount and shape of your data, as well as your infrastructure. The timing should improve upon upgrade to 17.3. If it is acceptable, then you are not affected. - Get a psql console to your primary site's DB, e.g. run
gitlab-rails dbconsole
on a Puma or Sidekiq node - Run
EXPLAIN SELECT MAX("p_ci_job_artifacts"."id") FROM "p_ci_job_artifacts" INNER JOIN "ci_job_artifact_states" ON "ci_job_artifact_states"."partition_id" IS NOT NULL AND "ci_job_artifact_states"."job_artifact_id" = "p_ci_job_artifacts"."id" AND "ci_job_artifact_states"."partition_id" = "p_ci_job_artifacts"."partition_id" WHERE "ci_job_artifact_states"."verification_state" = 2;
- If the query plan looks like this (it is processing a huge amount of data), then you are affected:
gitlabhq_production=> EXPLAIN SELECT MAX("p_ci_job_artifacts"."id") FROM "p_ci_job_artifacts" INNER JOIN "ci_job_artifact_states" ON "ci_job_artifact_states"."partition_id" IS NOT NULL AND "ci_job_artifact_states"."job_artifact_id" = "p_ci_job_artifacts"."id" AND "ci_job_artifact_states"."partition_id" = "p_ci_job_artifacts"."partition_id" WHERE "ci_job_artifact_states"."verification_state" = 2;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=3437228.45..3437228.46 rows=1 width=8)
-> Gather (cost=3437228.23..3437228.44 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=3436228.23..3436228.24 rows=1 width=8)
-> Parallel Hash Join (cost=1228572.96..3380654.60 rows=22229453 width=8)
Hash Cond: ((p_ci_job_artifacts.id = ci_job_artifact_states.job_artifact_id) AND (p_ci_job_artifacts.partition_id = ci_job_artifact_states.partition_id))
-> Parallel Append (cost=0.00..1709621.36 rows=22237958 width=16)
-> Parallel Seq Scan on ci_job_artifacts p_ci_job_artifacts_1 (cost=0.00..1598431.57 rows=22237957 width=16)
-> Parallel Seq Scan on ci_job_artifacts_102 p_ci_job_artifacts_2 (cost=0.00..0.00 rows=1 width=16)
-> Parallel Hash (cost=786588.17..786588.17 rows=22229453 width=16)
-> Parallel Seq Scan on ci_job_artifact_states (cost=0.00..786588.17 rows=22229453 width=16)
Filter: ((partition_id IS NOT NULL) AND (verification_state = 2))
(12 rows)
Workaround
- If you are on 17.3 or higher, then this PG index is already created and you're not affected.
- If you're on 16.11 - 17.2:
- Get a psql console to your primary site's DB, e.g. run
gitlab-rails dbconsole
on a Puma or Sidekiq node - Disable statement timeouts for this session. Run
SET statement_timeout = 0;
- Create the index which is already in 17.3. Run
create index concurrently if not exists index_on_job_artifact_id_partition_id_verification_state on ci_job_artifact_states (verification_state, job_artifact_id, partition_id)
- Get a psql console to your primary site's DB, e.g. run
Edited by Michael Kozono