Identify Database Related metrics that can signal issues during large data operations
We want to implement a throttling mechanism for large data changes (&7594).
Our first task is to use our knowledge of background migrations and review past incidents to identify:
- what we want to measure
- which of those measures have metrics that can be easily obtained in real time
- how we can use those metrics to throttle background migrations
We should also decide on how can we gather more expensive metrics that can not be calculated in real time and whether we would need to update the auto-tuning layer of batched background migrations to successfully do so. As an example, this is, most probably, the case for the metrics required for &7403 (closed).
As a rough summary of the later, from a live sync we had: We may need to add an additional process to gather those metrics and either work as a service for the auto-tuning layer or store them somewhere (a table for example) that is accessible by the auto-tuning layer. We would also have to implement mechanisms for the auto-tuning layer to signify that it wants statistics to be gathered for specific tables and also request to stop gathering those stats once the related migration is completed.