Queries time out for expired artifacts with unknown locked status
Problem with the current index
Given the currently non-functional performance of the job_id query to the ci_job_artifacts
table, I believe we're in need of a new index.
The current plan uses an index scan
Limit (cost=0.58..39.50 rows=100 width=16)
-> Index Scan using index_ci_job_artifacts_on_expire_at_and_job_id on ci_job_artifacts (cost=0.58..71493289.79 rows=183662867 width=16)
Index Cond: (expire_at < '2022-05-04 17:28:55.453263+00'::timestamp with time zone)
Filter: (locked = 2)
which is
CREATE INDEX index_ci_job_artifacts_on_expire_at_and_job_id ON ci_job_artifacts USING btree (expire_at, job_id);
that originally worked well for us since old artifacts tended to be locked = 2
. But a few days after we started locking artifacts (that is, not simply clearing through the oldest artifacts in this index which is ordered by expire_at
) performance became very poor. I think this is because this index has the entirety of the table in it, and we're attempting to filter for locked = 2
which this index is not good at.
Proposal for a new index
I suggest we add a new partial index tailored to this operation. While it is a highly specific index, I would recommend introducing it explicitly as a temporary index that will be removed when the backlog operation has finished and we remove the worker. The worker will live until 16.0, and the index would be removed at that point if not earlier. I propose we add:
CREATE INDEX tmp_index_job_artifacts_with_unknown_locked_status ON ci_job_artifacts USING btree (expire_at, job_id) WHERE locked = 2 AND expire_at IS NOT NULL;
This accomplishes two main things:
- Bringing artifacts with unknown locked status close to each other in an index to decrease the duration of index scans
- Excludes artifacts without
expire_at
values, since those artifacts are job traces that we don't want to process.
We've considered creating an index of locked = 2
artifacts before, but decided against it because it would take a great deal of time to build as it would have included every row in the table. However, now with the job traces no longer having an expiration date we can use the two conditions together to build a partial index containing less than 200m rows.
Bonus things:
- We can drop the non-temporary index
index_ci_job_artifacts_on_expire_at_and_job_id
earlier than expected, since I believe this background process is the only workflow that sends queries to it. For good measure, we would build the new index and not drop this old one until we see it remaining completely idle for a significant period of time.