Consider reactively throttling large data changes when degrading index scan performance

Summary

The severity1 incident gitlab-com/gl-infra/production#5952 (closed) was caused by a frequently run index scan becoming significantly less efficient.

This was induced by a sidekiq job performing background data changes on historical data in an order that correlates with a frequently scanned index.

The specific details of this data change allow for a work-around -- changing the order of the data changes so they no longer correlate with this particular index's order. But to help prevent future incidents with similar bulk data changes, we ideally want a more generic protection.

This issue aims to consider providing a mechanism for detecting if index scan efficiency is being degraded in this way, and pause the data change until the indexes are back in a healthy state.

More specifically, if a large number of dead index items accumulate at the start or end of an indexed column's range of live values, this adds overhead to the index scan, measurable as a large value for the "Heap Fetches" counter in the EXPLAIN ANALYZE output. This overhead only occurs when querying a standby db (i.e. a replica db, not the primary db or a dblabs clone).

Index scans other than searching for a column's min/max value can also be affected by this pathology, but the min/max case is definitely an important and common one. So we are starting with just that.

Brief intro to the pathology

The triggering condition that made this index scan inefficient was a relatively slow background data migration that happened to be choosing which rows to delete by walking the table's most frequently used index. This caused a dense accumulation of dead index items at the low end of the first indexed column's range of values.

Meanwhile, on the replica dbs, many concurrent readers were using that index. Each index scan that traversed that range of index items pointing to dead row-versions had to independently perform row-visibility checks for each visited dead item. As the incident progressed, there were as many as 600K of these dead index items that had to be traversed before the index scan found a single live row. That overhead gets repeated during every index scan until the next vacuum cleans up these dead tuples and their index items.

This already significant overhead per index scan was compounded by the fact that many of these index scans were running concurrently. Performing the row visibility check requires pinning the shared buffers containing the relevant parts of the table's visibility map, and that frequent pinning/unpinning results in spinlock and lwlock contention. This is where most of the CPU time was being spent.

Collectively this index scan inefficiency starved other queries for resources (mainly CPU time and the relevant lwlocks). This lead to overall slowness and saturation for the database connection pool, which emerged as the usual cascade of user-facing slowness and errors.

Preventing the index inefficiency from building to a critical point can prevent this causal chain. That approach is what we want to explore in this issue.

References

There are lots of insightful discussion threads in #346427 (closed) and #346940 (closed) and gitlab-com/gl-infra/production#5952 (closed). The following short list focuses on summaries of the findings, as a primer.

gitlab-com/gl-infra/production#5952 (comment 748517574) - Concise walk-through of the incident pathology. Recommended to start here.
gitlab-com/gl-infra/production#5952 (comment 751163463) - This is a direct follow-up to the above walk-through.
gitlab-com/gl-infra/production#5952 (comment 743067011) - Earlier walk-through, focused on the details of why the planner itself was running an index scan.
#346940 (comment 752257442) - Illustration that the overall number of dead tuples is a poor indicator of the risk of this regression. Instead, we need to detect dense accumulations of dead tuples.
#346427 (comment 746784163) - Marking index tuples as dead only occurs on the writable primary db, not read-only replicas

Impact

During the incident:

The database service apdex dropped below SLO for several hours:

Dashboard link

That performance regression put enough backpressure on the rails clients that the db request rate dropped well below its norm:

Dashboard link

Query duration and replication lag show the same pattern.

Crucially, the batches of deletes run by ExpireBuildArtifactWorker also correlate with the regression, as documented in more of this thread.

All of the streaming replica dbs had a large increase in userspace CPU time:

A whole-host CPU profile of an example replica dbs showed over 70% of postgres CPU time was spent in the query planner (not the query executor). Profiles like this were instrumental in post-incident analysis discussions such as here and here.

Recommendation

To limit the accumulation of overhead for index scans, we can make the data change job start by checking whether or not any of the indexes have to traverse many dead tuples to return the min/max column value. If that counter exceeds a configured threshold (e.g. 10K heap fetches), then the data change should defer its current batch of work. Vacuum will eventually clean up those dead tuples, at which point the pre-flight check will start to pass again.

This check must be run on a replica db, not the primary db. This is crucial, because the primary db will almost never exhibit this pathology, but the replica dbs are all susceptible to it.

Concretely we can start with just checking the min and max values of the first column of each full btree index on the table targeted by the data change. Running a query like the following will return a Heap Fetch count. If that count is large (e.g. 10K), this is the warning sign to pause the data change until vacuum cleans up the dead tuple references that have accumulated at one end of that index.

explain ( analyze ) select min(job_id), max(job_id) from ci_job_artifacts ;

                                                                                                           QUERY PLAN                                                                                                           
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=1.21..1.22 rows=1 width=16) (actual time=0.047..0.048 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.58..0.60 rows=1 width=8) (actual time=0.020..0.020 rows=1 loops=1)
           ->  Index Only Scan using index_ci_job_artifacts_on_job_id_and_file_type on ci_job_artifacts  (cost=0.58..38998942.87 rows=1452802176 width=8) (actual time=0.019..0.019 rows=1 loops=1)
                 Index Cond: (job_id IS NOT NULL)
                 Heap Fetches: 0
   InitPlan 2 (returns $1)
     ->  Limit  (cost=0.58..0.60 rows=1 width=8) (actual time=0.025..0.025 rows=1 loops=1)
           ->  Index Only Scan Backward using index_ci_job_artifacts_on_job_id_and_file_type on ci_job_artifacts ci_job_artifacts_1  (cost=0.58..38998942.87 rows=1452802176 width=8) (actual time=0.025..0.025 rows=1 loops=1)
                 Index Cond: (job_id IS NOT NULL)
                 Heap Fetches: 1
 Planning Time: 0.148 ms
 Execution Time: 0.061 ms
(13 rows)

Note: For easier parsing, we can make that EXPLAIN ( ANALYZE, FORMAT JSON ) ..., but for illustration purposes, the default text output format is easier to read.

The attached notes demonstrate how to find the list of columns to be checked:

2021-12-07_find_1st_column_of_each_btree_index_on_a_table.txt

Verification

The pathology we want to avoid is having index scans on the target table have to traverse many dead tuples before finding their intended live tuples.

If that starts to happen, we want the data change to pause.

The mechanism proposed above attempts to directly measure that overhead.

The following Prometheus query compliments that direct measurement by looking at the overall workload for each index on the target table. If many dead tuples accumulate at any frequently scanned range of the index, this will show up as a sharp increase in the mean number of heap fetches per index scan.

PromQL:

sum by (relname, indexrelname, fqdn) (rate(pg_stat_user_indexes_idx_tup_read{env="gprd", relname="ci_job_artifacts"}[1m]) and on (env, fqdn) (pg_replication_is_replica == 1))
/
( sum by (relname, indexrelname, fqdn) (rate(pg_stat_user_indexes_idx_scan{env="gprd", relname="ci_job_artifacts"}[1m]) and on (env, fqdn) (pg_replication_is_replica == 1)) > 1)

Thanos query

For context, each index's normal ratio of heap fetches per scan depends on the workload characteristics, so there is not a fixed value for that ratio that is "good" or "bad". Instead, a sharp and sustained increase of that ratio is highly suggestive of this pathology -- dead tuples accumulated in a frequently scanned portion of the index (possibly at the min or max value, or possibly elsewhere), and vacuum has not yet cleaned them up.

So, if the increase in that ratio correlates with the data change and it does not disappear after a couple minutes, that is likely to indicate a regression.

Pausing the data change (via the proposed mechanism or any other mechanism) should stop the increase in that ratio, but only vacuum will reduce that ratio back to where it started.

Note: A change in workload or query plan for a frequently run query can potentially also cause a volatile change in the ratio of heap fetches per index scan, but that is unlikely to be both sustained over time and correlated with the data change.

For more details on the above example, see: gitlab-com/gl-infra/production#5952 (comment 757151499)

Edited Dec 08, 2021 by Matt Smiley