Lower autovacuuming settings for ci_job_artifacts table

Summary

As described in gitlab-org/gitlab#346427 (closed), we experienced significant downtime due to the query planner on the replicas taking a long time to sift through dead tuples in the database:

Many old rows in ci_job_artifacts were deleted, but since there are over 1.4 billion rows in that table, the autovacuum threshold (7+ million rows) was not reached. As a result, there were a lot of dead tuples with low job_id values.
These old rows need to be scanned when the query planner calculates min(job_id).
When planning a min(job_id) query on primaries, the primary opportunistically marks dead tuples with a LP_DEAD hint bit. This significantly reduces future buffer reads, but these hint bits are ignored by replicas at the moment. There is a PostgreSQL patch in review that should fix this, but the earliest it could land is PostgreSQL 15.

Related Incident(s)

production#5952 (closed)

Originating issue(s): production#5952 (closed)

Desired Outcome/Acceptance criteria

We discovered the autovacuuming settings for ci_job_artifacts table need to be lowered so vaccuuming runs more frequently to avoid dead tuples in index scans. We have about 1.4 billion rows today:

gitlabhq_production=# SELECT relpages, reltuples FROM pg_class WHERE relname = 'ci_job_artifacts';
 relpages |  reltuples
----------+--------------
 31676161 | 1.417529e+09
(1 row)

Since autovacuum_analyze_scale_factor is 0.005 on GitLab.com, this means we need about 7 million dead tuples to trigger vacuuming. That's a lot of dead tuples for the index scans, and without the killed tuple hint bits, the replica has to look through a lot of data whenever it tries to calculate min(id) with ci_job_artifacts during query planning. Since the replicas get hit hard with lots of these ci_job_artifacts queries, this makes for a perfect storm of buffer mapping contention if there are a lot of dead tuples.

Given all the other autovacuuming, I'm wondering what scale factor would be reasonable here. We know the incident occurred when we deleted a lot of old rows, so what is the largest number of dead heap pages can we tolerate on the replica? If, for example, we target a threshold of 1,000,000 dead tuples, we might consider setting the scale factor to 0.0007.

Associated Services

PostgreSQL

Corrective Action Issue Checklist

link the incident(s) this corrective action arose out of
give context for what problem this corrective action is trying to prevent from re-occurring
assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
assign a priority (this will default to 'priority::4')

Edited Dec 09, 2021 by Stan Hu