Implement throttling mechanism for large data changes
*This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.* ### Problem During the past year, we have had a few database related incidents in GitLab.com that were directly impacted by the rate of updates on specific tables or in which the rate of updates amplified the problem. A great example of such a case is the ~"severity::1" incident gitlab-com/gl-infra/production#5952, which was caused by a frequently run index scan becoming significantly less efficient. This was induced by a sidekiq job performing background data changes on historical data in an order that correlated with a frequently scanned index. After investigating that incident (gitlab-org/gitlab#346427, gitlab-org/gitlab#346940), in addition to the [short term corrective actions](https://gitlab.com/gitlab-org/gitlab/-/issues/346940#corrective-actions) proposed, a longer term approach was identified: reactively throttle large data changes when degrading index scan performance is observed (gitlab-org&7403). Similar throttling approaches could be potentially useful for other database related incidents as well, in which providing some time to PostgreSQL's internal maintenance mechanisms (e.g. `autovacuum`) to kick in would have helped significantly to mitigate the problem. As an example, https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4486#note_573068599 would have benefited from the related updates automatically pausing once the problem was identified and a manual `vacuum` being executed. ### Approach We want to generalize the approach described above and implement a generic throttling mechanism for large data changes that will monitor the health of the Database for various signals (leading indicators) and react to problems by throttling or even pausing the execution of the updates. We think that the best place to implement this mechanism is in the auto-tuning layer of the Batched Background migrations: 1. Most large data changes in GitLab are executed through background migrations. 1. We plan to replace all existing background migrations with batched background migrations (gitlab-org&6751). This is the most relevant place in our code to introduce the throttling mechanism and will address most cases. 1. We already support an auto-tuning mechanism in batched background migrations that allows us to throttle the rate of updates. - We fully control the execution of batched background migrations through our custom scheduler. - We currently take into account the time to execute a batch. - We have the built-in mechanisms to support throttling or even pausing a background migration. Adding support for additional metrics and actions will be way easier than implementing a throttling mechanism outside of the batched background migrations framework. We could add such a mechanism as a stand alone library to be used in any migration or large data operation (e.g. in a cron job), but doing so outside of batched background migrations would require additional work on rescheduling regular migrations, dealing with sidekiq queues and/or postponing other operations. 1. In the long term, we also plan to use batched background migrations to support all other scheduled large data operations that are currently implemented through scheduled cron jobs. Once this is supported, we will be able to automatically cover almost all types of large data operations. ### Plan _(The following is still a work in progress and will change following the discussion that we are going to have)_ Our plan is to build a mechanism that for each background migration actively running, it will be able to monitor : 1. Records updated per time for the relevant tables 1. Dead tuple statistics for the relevant tables 1. Monitor for a large number of "Heap Fetches" for some sample queries relative to the expected one 1. Check for active `autovacuum` processes (is it running right now for a relevant table?) 1. Take into account `autovacuum` settings (how often do we expect a clean up?) Based on the above, we can update our auto-tuning layer to: - Throttle the update rate of data related operations based on records updated per time (basic) - Take into account overall dead tuple statistics for the table(s) updated and monitor in real time for other critical queries, heap fetches and other variable parameters that can adversely affect the performance of the operation and the Database server in general. - Enable additional maintenance decisions based on the previous parameters, even manually vacuum the table if required, while taking into account the autovacuum settings for that table and - Monitor active autovacuum processes and automatically pause execution of background jobs while autovacuum is in progress (potentially important for large tables where autovacuum can take a while and we'd like to unblock it)
epic