Change fillfactor for `ci_builds` table
Production Change
Change Summary
In order to increase the likelihood of an updated tuple landing on the same page, this change is going to decrease the fillfactor
table-setting for ci_builds
.
The background and reasoning can be found in gitlab-org/gitlab#330507 (closed).
If this shows the desired outcome, it'll greatly help to avoid delays in consuming the shared runner CI queue (e.g. #4486 (closed) - we have about 40 similar incidents.
Implications/risk
The expected benefit from this is that we see index-only scans considerably speed up (for "the big CI query"). This is because we expect it will be more likely for an updated tuple to land on the same page as the original one and hence we don't need to scan as many pages as if this was scatter all over different pages.
A potential negative effect is that new pages will have 20% of space reserved for updates. If there were no updates at all, this would cause a 20% overhead in terms of disk space and also for shared buffers. However, we have established that there is a significant frequency of updates for a new record in ci_builds
- so effectively this is not a concern.
Changing the fillfactor
only affects new pages going forward, old remain unchanged. This is desirable since we don't expect to update existing old build records much if at all.
The idea is to change the setting and observe the impact for 1-2 working days (won't be visible on weekends most likely) before we make a call whether or not this will be a permanent setting.
A rollback will be possible by changing the fillfactor
back to 100.
We should rollback if we ultimately don't see the desired positive effect on "the big CI query" or elsewhere. However, we may want to try other fillfactor
settings (going down even further than 80) before we ultimately roll back.
Change Details
- Services Impacted - postgres
- Change Technician - @abrandl, @msmiley
- Change Criticality - C2
- Change Type - changeunscheduled
- Change Reviewer - @msmiley
- Due Date - 2021-05-13
- Time tracking - 2 minutes to apply, same to rollback. More time will be spent evaluating the effectiveness of the tuning adjustment.
- Downtime Component - None expected
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
Run this on the PG primary:
-
alter table ci_builds set (fillfactor = 80)
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
\d+ ci_builds
and checkfillfactor=80
shows up underOptions
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
alter table ci_builds set (fillfactor = 100)
Monitoring
Key metrics to observe
- Metric: Rate of HOT updates on the table
- Location: Thanos
- We expect an increase here - the more the better the change.
- Metric: Conduct query plan monitoring similar to #4486 (comment 569642848) and track
Heap fetches
for the CI queuing query.- This should go down over time.
- Metric: CI queuing latency - https://dashboards.gitlab.net/d/ci-runners-incident-database/ci-runners-incident-support-database?viewPanel=23&orgId=1&refresh=1m&from=now-6h&to=now
- This should become more stable or go down
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.