Lower autovacuuming settings for ci_job_artifacts table
Summary
As described in gitlab-org/gitlab#346427 (closed), we experienced significant downtime due to the query planner on the replicas taking a long time to sift through dead tuples in the database:
- Many old rows in
ci_job_artifacts
were deleted, but since there are over 1.4 billion rows in that table, the autovacuum threshold (7+ million rows) was not reached. As a result, there were a lot of dead tuples with lowjob_id
values. - These old rows need to be scanned when the query planner calculates
min(job_id)
. - When planning a
min(job_id)
query on primaries, the primary opportunistically marks dead tuples with aLP_DEAD
hint bit. This significantly reduces future buffer reads, but these hint bits are ignored by replicas at the moment. There is a PostgreSQL patch in review that should fix this, but the earliest it could land is PostgreSQL 15.
Related Incident(s)
Originating issue(s): production#5952 (closed)
Desired Outcome/Acceptance criteria
We discovered the autovacuuming settings for ci_job_artifacts
table need to be lowered so vaccuuming runs more frequently to avoid dead tuples in index scans. We have about 1.4 billion rows today:
gitlabhq_production=# SELECT relpages, reltuples FROM pg_class WHERE relname = 'ci_job_artifacts';
relpages | reltuples
----------+--------------
31676161 | 1.417529e+09
(1 row)
Since autovacuum_analyze_scale_factor
is 0.005 on GitLab.com, this means we need about 7 million dead tuples to trigger vacuuming. That's a lot of dead tuples for the index scans, and without the killed tuple hint bits, the replica has to look through a lot of data whenever it tries to calculate min(id)
with ci_job_artifacts
during query planning. Since the replicas get hit hard with lots of these ci_job_artifacts
queries, this makes for a perfect storm of buffer mapping contention if there are a lot of dead tuples.
Given all the other autovacuuming, I'm wondering what scale factor would be reasonable here. We know the incident occurred when we deleted a lot of old rows, so what is the largest number of dead heap pages can we tolerate on the replica? If, for example, we target a threshold of 1,000,000 dead tuples, we might consider setting the scale factor to 0.0007.
Associated Services
- PostgreSQL
Corrective Action Issue Checklist
-
link the incident(s) this corrective action arose out of -
give context for what problem this corrective action is trying to prevent from re-occurring -
assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
assign a priority (this will default to 'priority::4')