Increase likelihood of HOT updates on ci_builds

This is a follow-up to the discussion in gitlab-com/gl-infra/production#4486 (comment 570520539).

We've had numerous occasions of CI queuing queries slow down in the recent weeks. The query timings look like a saw tooth pattern:

image__1_

We have tried to improve the situation with a targeted index to facilitate an index-only scan: #327142 (closed) The index is being picked up and we do performan an index-only scan. However, there's an unexpected large amount of heap fetches we need to do in order to lookup visibility information. We suspect this is due to a high number of updates to a new build record.

In more detail, the number of pages fetched per tuple returned varies from 30 to more than 65:

shared_rows_ratio

The reason for this is believed to be a high number of updates to new build records. There is two actions we should consider in order to improve this:

  1. Decrease fillfactor for ci_builds
  2. Remove index on ci_builds.updated_at if we can

Reasoning for fillfactor

A lower fillfactor on the table increases our chances to find enough space for a new tuple version to go into the same page. We basically reserve some space on the page for updates. This in turn can decrease the likelihood of an updated tuple version to go into an entirely different page - hence returning the amount of pages we need to scan for visibility information during the index scan.

Reasoning for index removal

HOT updates have the same effect: The updated tuple version goes into the same page (leveraging HOT), instead of a different page we have to scan for visibility.

There are two pre-requisites for HOT:

  1. Enough space in the page (another reason to decrease fillfactor)
  2. The UPDATE does not modify an indexed column.

The latter is the reason to remove the index on updated_at. This column is being maintained by Rails automatically, whenever the record changes. That is, we should expect to see a changed updated_at with any update. Since this column is indexed, this effectively prevents HOT from doing its thing.

Status

We have tuned the fillfactor on the ci_builds table from 100% (default) to 80% (reserves 20% of space in each block for future updates).

Unfortunately this did not improve the duration of "The Big Query". Yesterday we realized that the combination of rows in this table being both wide and going through several mutations before being vacuumed means that it may not be possible to get a high percentage of HOT updates for the normal application workload (not counting the one-time background migration). Running the experiment today (2021-05-13) seems to confirm that conclusion. The current hypothesis is that this table's rows are too wide and are updated too often for a meaningful percentage of row-versions to fit in the space reserved by fillfactor.

With what we know now, #329764 (closed) seems like our best option.

As of 2021-05-11, this was our high-level status summary: gitlab-com/gl-infra/production#4486 (comment 573068599)

As of 2021-05-13, this summarized the results of the fillfactor experiment:

Edited by Matt Smiley