2021-02-01: Monthly CI budget reset triggers series of db performance regressions

Summary

The monthly scheduled job for resetting the budget of CI minutes available to each namespace is known to cause intermittent spikes in db response time. This incident documents this month's occurrence of this behavior.

The current implementation of the BatchResetService updates numerous rows in the database, spreading this work across a 24-hour timespan in batches. Each batch has a chance of triggering the regression.

The regression itself is expected to be caused by queries being at greater risk of having poor execution plans if they involve any of the 3 tables being bulk-updated. Among those tables, the namespaces table is the most relevant, as many many queries include it.

Our short-term mitigation is to refresh the optimizer statistics for these 3 tables when we see response time spiking. There are several other better long-term options, including:

Tune this table to be prioritized by autovacuum and to let it run more aggressively (i.e. little or no cost-delay).
Make the application code explicitly analyze the affected tables immediately before committing their transaction. This minimizes the window during which stale statistics are present. (In this case, the table-level statistics indicating block count and row count are probably the relevant ones, but we get fresh column-level statistics too.)
Change the model for tracking available vs. spent CI minutes, so it stops requiring updates to nearly every row in a frequently accessed table.

Timeline

All times UTC.

2021-02-01

01:20 - @msmiley declares incident in Slack.

Corrective Actions

enabled a feature flag that tracks CI minutes on a monthly basis with an automatic/lazy reset: https://gitlab.com/gitlab-org/gitlab/-/issues/300803#note_517151424.
Reduce DB load when resetting CI minute notifications gitlab-org/gitlab#323069 (closed)

Click to expand or collapse the Incident Review section.

Incident Review

Summary

Service(s) affected:
Team attribution:
Time to detection:
Minutes downtime or degradation:

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. ...
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. ...
How many customers were affected?
1. ...
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. ...

What were the root causes?

"5 Whys"

Incident Response Analysis

How was the incident detected?
1. ...
How could detection time be improved?
1. ...
How was the root cause diagnosed?
1. ...
How could time to diagnosis be improved?
1. ...
How did we reach the point where we knew how to mitigate the impact?
1. ...
How could time to mitigation be improved?
1. ...
What went well?
1. ...

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. ...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. ...

Lessons Learned

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Incident Review Stakeholders

Edited Mar 05, 2021 by John Jarvis