Skip to content

Lower autovacuuming settings for ci_job_artifacts table

Summary

As described in gitlab-org/gitlab#346427 (closed), we experienced significant downtime due to the query planner on the replicas taking a long time to sift through dead tuples in the database:

  1. Many old rows in ci_job_artifacts were deleted, but since there are over 1.4 billion rows in that table, the autovacuum threshold (7+ million rows) was not reached. As a result, there were a lot of dead tuples with low job_id values.
  2. These old rows need to be scanned when the query planner calculates min(job_id).
  3. When planning a min(job_id) query on primaries, the primary opportunistically marks dead tuples with a LP_DEAD hint bit. This significantly reduces future buffer reads, but these hint bits are ignored by replicas at the moment. There is a PostgreSQL patch in review that should fix this, but the earliest it could land is PostgreSQL 15.

Related Incident(s)

production#5952 (closed)

Originating issue(s): production#5952 (closed)

Desired Outcome/Acceptance criteria

We discovered the autovacuuming settings for ci_job_artifacts table need to be lowered so vaccuuming runs more frequently to avoid dead tuples in index scans. We have about 1.4 billion rows today:

gitlabhq_production=# SELECT relpages, reltuples FROM pg_class WHERE relname = 'ci_job_artifacts';
 relpages |  reltuples
----------+--------------
 31676161 | 1.417529e+09
(1 row)

Since autovacuum_analyze_scale_factor is 0.005 on GitLab.com, this means we need about 7 million dead tuples to trigger vacuuming. That's a lot of dead tuples for the index scans, and without the killed tuple hint bits, the replica has to look through a lot of data whenever it tries to calculate min(id) with ci_job_artifacts during query planning. Since the replicas get hit hard with lots of these ci_job_artifacts queries, this makes for a perfect storm of buffer mapping contention if there are a lot of dead tuples.

Given all the other autovacuuming, I'm wondering what scale factor would be reasonable here. We know the incident occurred when we deleted a lot of old rows, so what is the largest number of dead heap pages can we tolerate on the replica? If, for example, we target a threshold of 1,000,000 dead tuples, we might consider setting the scale factor to 0.0007.

Associated Services

  • PostgreSQL

Corrective Action Issue Checklist

  • link the incident(s) this corrective action arose out of
  • give context for what problem this corrective action is trying to prevent from re-occurring
  • assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
  • assign a priority (this will default to 'priority::4')
Edited by Stan Hu