Reduce the number of DB updates during the CI Minutes monthly reset
Problem to Solve
The incident from gitlab-com/gl-infra/production#3464 (closed) has resulted in an impact on our SLA on gitlab.com as the intermittent database spikes has impacted the response time.
There have been several incidents late last year related to this recent one that CI has attempted to mitigate by spreading the workers over a longer period of time.
Incidents | Attempted Resolutions by CI |
---|---|
gitlab-com/gl-infra/production#3268 (closed) (Jan 1 2021) | Spread workers over 24h from 8h |
gitlab-com/gl-infra/production#3101 (closed) (Dec 1 2020) | Spread workers over 8h from 3h |
gitlab-com/gl-infra/production#2950 (closed) (Nov 1 2020) | |
gitlab-com/gl-infra/production#2779 (closed) (Oct 1 2020) | Spread workers over 3h |
gitlab-com/gl-infra/production#2605 (comment 405985845) (Sep 1 2020) |
Technical consideration - Questions to answer
-
A bug fix in %13.3 introduced additional database updates to
namespaces
, in which all projects in a given namespace were having their CI minutes reset, instead of only only the ones with non-zero usage, which was the previous logic. The number of updates were also noted as potentially being a problem - !38057 (comment 387876432). Product has agreed that we could revert this bug fix as the MR was intended to resolve a severity3 UX improvement. We may also explore other ways of updating the UI without necessitating a large DB update of this volume if it would help with mitigating this incident. Note that since this bug fix MR was deployed as part of %13.3, which has a Aug 17 2020 release date, the increase in DB updates is in line with the incidents we have been noticing as of Sep 1st, 2020. -
The workers are currently configured to update over a 24h period, which ended up exacerbating the problem for a 24 hour period. Should we consider reverting it back to 8 hours, 3 hours, as it was previously set? Or even its original implementation (e.g. reverting !46927 (merged))
Implementation
Per the analysis in #300979 (comment 501614700)
- Revert !38057 (merged)
- Go back to using a 3 hours spread window for the jobs: gitlab-com/gl-infra/production#3464 (comment 499899705)