Skip to content

GitLab Next

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

GitLab.com
GitLab Infrastructure Team
production
Issues
#4585

Change fillfactor for `ci_builds` table

Production Change

Change Summary

In order to increase the likelihood of an updated tuple landing on the same page, this change is going to decrease the fillfactor table-setting for ci_builds.

The background and reasoning can be found in gitlab-org/gitlab#330507 (closed).

If this shows the desired outcome, it'll greatly help to avoid delays in consuming the shared runner CI queue (e.g. #4486 (closed) - we have about 40 similar incidents.

Implications/risk

The expected benefit from this is that we see index-only scans considerably speed up (for "the big CI query"). This is because we expect it will be more likely for an updated tuple to land on the same page as the original one and hence we don't need to scan as many pages as if this was scatter all over different pages.

A potential negative effect is that new pages will have 20% of space reserved for updates. If there were no updates at all, this would cause a 20% overhead in terms of disk space and also for shared buffers. However, we have established that there is a significant frequency of updates for a new record in ci_builds - so effectively this is not a concern.

Changing the fillfactor only affects new pages going forward, old remain unchanged. This is desirable since we don't expect to update existing old build records much if at all.

The idea is to change the setting and observe the impact for 1-2 working days (won't be visible on weekends most likely) before we make a call whether or not this will be a permanent setting.

A rollback will be possible by changing the fillfactor back to 100.

We should rollback if we ultimately don't see the desired positive effect on "the big CI query" or elsewhere. However, we may want to try other fillfactor settings (going down even further than 80) before we ultimately roll back.

Change Details

Services Impacted - postgres
Change Technician - @abrandl, @msmiley
Change Criticality - C2
Change Type - changeunscheduled
Change Reviewer - @msmiley
Due Date - 2021-05-13
Time tracking - 2 minutes to apply, same to rollback. More time will be spent evaluating the effectiveness of the tuning adjustment.
Downtime Component - None expected

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Set label changein-progress on this issue

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Run this on the PG primary:

alter table ci_builds set (fillfactor = 80)

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

\d+ ci_builds and check fillfactor=80 shows up under Options

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

alter table ci_builds set (fillfactor = 100)

Monitoring

Key metrics to observe

Metric: Rate of HOT updates on the table
- Location: Thanos
- We expect an increase here - the more the better the change.
Metric: Conduct query plan monitoring similar to #4486 (comment 569642848) and track Heap fetches for the CI queuing query.
- This should go down over time.
Metric: CI queuing latency - https://dashboards.gitlab.net/d/ci-runners-incident-database/ci-runners-incident-support-database?viewPanel=23&orgId=1&refresh=1m&from=now-6h&to=now
- This should become more stable or go down

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited May 13, 2021 by Matt Smiley

Assignee

Select assignees

Time tracking