JobArtifactState.checksummed_count query timing out

Summary

GEO is facing timeouts due to a query from the JobArtifactState table.

To mitigate it we need to apply one (or both) of the following solutions:

Create a new index on (verification_state, job_artifact_id, partition_id)
Modify the query in order to make it more performant on the query planner

The query in question is this one

Steps to reproduce

Log in to the production console
Run the commands described here

To confirm if you are affected

This affects GitLab 16.11, 17.0, 17.1, and 17.2.

You can confirm this with:

The status of a secondary site is no longer updating, e.g. it's more than 1 day old.
If the status is never more than 1 day old, then it is getting updated at least once a day. It will say Unhealthy if the status is more than 10 minutes old, but the 10 minute threshold is arbitrary. In this situation, you should decide for yourself what threshold for status updates is acceptable. This timing depends on the amount and shape of your data, as well as your infrastructure. The timing should improve upon upgrade to 17.3. If it is acceptable, then you are not affected.
Get a psql console to your primary site's DB, e.g. run gitlab-rails dbconsole on a Puma or Sidekiq node
Run EXPLAIN SELECT MAX("p_ci_job_artifacts"."id") FROM "p_ci_job_artifacts" INNER JOIN "ci_job_artifact_states" ON "ci_job_artifact_states"."partition_id" IS NOT NULL AND "ci_job_artifact_states"."job_artifact_id" = "p_ci_job_artifacts"."id" AND "ci_job_artifact_states"."partition_id" = "p_ci_job_artifacts"."partition_id" WHERE "ci_job_artifact_states"."verification_state" = 2;
If the query plan looks like this (it is processing a huge amount of data), then you are affected:

gitlabhq_production=> EXPLAIN SELECT MAX("p_ci_job_artifacts"."id") FROM "p_ci_job_artifacts" INNER JOIN "ci_job_artifact_states" ON "ci_job_artifact_states"."partition_id" IS NOT NULL AND "ci_job_artifact_states"."job_artifact_id" = "p_ci_job_artifacts"."id" AND "ci_job_artifact_states"."partition_id" = "p_ci_job_artifacts"."partition_id" WHERE "ci_job_artifact_states"."verification_state" = 2;
                                                                                  QUERY PLAN                                                                                   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=3437228.45..3437228.46 rows=1 width=8)
   ->  Gather  (cost=3437228.23..3437228.44 rows=2 width=8)
         Workers Planned: 2     
         ->  Partial Aggregate  (cost=3436228.23..3436228.24 rows=1 width=8)
               ->  Parallel Hash Join  (cost=1228572.96..3380654.60 rows=22229453 width=8)
                     Hash Cond: ((p_ci_job_artifacts.id = ci_job_artifact_states.job_artifact_id) AND (p_ci_job_artifacts.partition_id = ci_job_artifact_states.partition_id))
                     ->  Parallel Append  (cost=0.00..1709621.36 rows=22237958 width=16)
                           ->  Parallel Seq Scan on ci_job_artifacts p_ci_job_artifacts_1  (cost=0.00..1598431.57 rows=22237957 width=16)
                           ->  Parallel Seq Scan on ci_job_artifacts_102 p_ci_job_artifacts_2  (cost=0.00..0.00 rows=1 width=16)
                     ->  Parallel Hash  (cost=786588.17..786588.17 rows=22229453 width=16)
                           ->  Parallel Seq Scan on ci_job_artifact_states  (cost=0.00..786588.17 rows=22229453 width=16)
                                 Filter: ((partition_id IS NOT NULL) AND (verification_state = 2))
(12 rows)

Workaround

If you are on 17.3 or higher, then this PG index is already created and you're not affected.
If you're on 16.11 - 17.2:
1. Get a psql console to your primary site's DB, e.g. run gitlab-rails dbconsole on a Puma or Sidekiq node
2. Disable statement timeouts for this session. Run SET statement_timeout = 0;
3. Create the index which is already in 17.3. Run create index concurrently if not exists index_on_job_artifact_id_partition_id_verification_state on ci_job_artifact_states (verification_state, job_artifact_id, partition_id)

Edited Aug 30, 2024 by Michael Kozono