Increase likelihood of HOT updates on ci_builds
This is a follow-up to the discussion in gitlab-com/gl-infra/production#4486 (comment 570520539).
We've had numerous occasions of CI queuing queries slow down in the recent weeks. The query timings look like a saw tooth pattern:
We have tried to improve the situation with a targeted index to facilitate an index-only scan: #327142 (closed) The index is being picked up and we do performan an index-only scan. However, there's an unexpected large amount of heap fetches we need to do in order to lookup visibility information. We suspect this is due to a high number of updates to a new build record.
In more detail, the number of pages fetched per tuple returned varies from 30 to more than 65:
The reason for this is believed to be a high number of updates to new build records. There is two actions we should consider in order to improve this:
- Decrease
fillfactorforci_builds - Remove index on
ci_builds.updated_atif we can
Reasoning for fillfactor
A lower fillfactor on the table increases our chances to find enough space for a new tuple version to go into the same page. We basically reserve some space on the page for updates. This in turn can decrease the likelihood of an updated tuple version to go into an entirely different page - hence returning the amount of pages we need to scan for visibility information during the index scan.
Reasoning for index removal
HOT updates have the same effect: The updated tuple version goes into the same page (leveraging HOT), instead of a different page we have to scan for visibility.
There are two pre-requisites for HOT:
- Enough space in the page (another reason to decrease fillfactor)
- The UPDATE does not modify an indexed column.
The latter is the reason to remove the index on updated_at. This column is being maintained by Rails automatically, whenever the record changes. That is, we should expect to see a changed updated_at with any update. Since this column is indexed, this effectively prevents HOT from doing its thing.
Status
We have tuned the fillfactor on the ci_builds table from 100% (default) to 80% (reserves 20% of space in each block for future updates).
Unfortunately this did not improve the duration of "The Big Query". Yesterday we realized that the combination of rows in this table being both wide and going through several mutations before being vacuumed means that it may not be possible to get a high percentage of HOT updates for the normal application workload (not counting the one-time background migration). Running the experiment today (2021-05-13) seems to confirm that conclusion. The current hypothesis is that this table's rows are too wide and are updated too often for a meaningful percentage of row-versions to fit in the space reserved by fillfactor.
With what we know now, #329764 (closed) seems like our best option.
As of 2021-05-11, this was our high-level status summary: gitlab-com/gl-infra/production#4486 (comment 573068599)
As of 2021-05-13, this summarized the results of the fillfactor experiment:
- Data and graphs: gitlab-com/gl-infra/production#4585 (comment 575320959)
- Interpretations: #330507 (comment 575309623) and gitlab-com/gl-infra/production#4585 (comment 575273624)

