Dynamically determine mirror update interval based on total number of mirrors, average update time, and available concurrency (#5258) · Issues · GitLab.org / GitLab

Dynamically determine mirror update interval based on total number of mirrors, average update time, and available concurrency

<details> <summary> Everyone can contribute. [Help move this issue forward](https://handbook.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows/#contributor-links) while earning points, leveling up and collecting rewards. </summary> - [Close this issue](https://contributors.gitlab.com/manage-issue?action=close&projectId=278964&issueIid=5258) </details>  Replaces https://gitlab.com/gitlab-org/gitlab-ee/issues/3885. If we allow the user to configure a maximum concurrency for mirror updates (which needs to be less than the actual Sidekiq concurrency), we can determine the ideal update interval based on total number of mirrors and their average update time, like this: ```ruby concurrency = 5 * 7 # GitLab.com has 5 sidekiq-pullmirror nodes, each with 7 threads reserved for mirror updates recent_mirrors = Project.mirror.joins(:mirror_data).with_import_status('finished').where("projects.mirror_last_update_at > ?", Time.now - 24.hours) mirrors_count = recent_mirrors.count mirror_update_time = recent_mirrors.average('EXTRACT(epoch FROM projects.mirror_last_update_at - project_mirror_data.last_update_started_at)').to_f total_update_time = mirrors_count * mirror_update_time concurrent_update_time = total_update_time / concurrency ``` `concurrent_update_time` now holds the time it takes to update every mirror once. After updating a mirror, we can calculate the next time it can safely update like so: ```ruby self.next_execution_timestamp = Time.now + concurrent_update_time.seconds * rand(1.0..1.2) ``` The addition of `* rand(1.0..1.2)` ensures that updates are spread out a little bit, and that there is some room for a sudden jump in number of mirrors. On GitLab.com, these calculations work out to an ideal interval of just under 23 minutes: ``` > concurrency = 5 * 7 # GitLab.com has 5 sidekiq-pullmirror nodes, each with 7 threads reserved for mirror updates => 35 > recent_mirrors = Project.mirror.joins(:mirror_data).with_import_status('finished').where("projects.mirror_last_update_at > ?", Time.now - 24.hours); nil => nil > mirrors_count = recent_mirrors.count => 19654 > mirror_update_time = recent_mirrors.average('EXTRACT(epoch FROM projects.mirror_last_update_at - project_mirror_data.last_update_started_at)').to_f => 2.44642611533372 > total_update_time = mirrors_count * mirror_update_time => 48082.05887076894 > concurrent_update_time = total_update_time / concurrency => 1373.7731105933983 > concurrent_update_time_minutes = concurrent_update_time / 60 => 22.896218509889973 ``` Looking at the actual mirror update data on GitLab.com, we see the same number. 99.8% of the projects that were last updated in the last 24 hours, were updated in the last 23 minutes: ``` > Project.mirror.joins(:mirror_data).with_import_status('finished').where("projects.mirror_last_update_at > ?", Time.now - 23.minutes).count => 19617 ``` The 0.2% difference can be attributed to the fact that mirrors are transitioning between `finished`, `scheduled` and `started` constantly, which means the total size of the `Project.mirror.joins(:mirror_data).with_import_status('finished')` set changes along with them. All of this means that on GitLab.com, we would end up updating each mirror somewhere between every $`23`$ and $`23 * 1.2 \approx 28`$ minutes. For smaller instances, we would need some defaults so that an instance with a single mirror won't try to update it every single second. Some more thoughts: - We can make `MAX_RETRY` automatically be determined based on the interval determined above, as suggested in https://gitlab.com/gitlab-org/gitlab-ee/issues/3885. - We currently punish slow updates by pushing their `next_execution_timestamp` further out into the future, and we could continue to do this by looking at the number of standard deviations by which their update time exceeds the average. - We can likely do away with the "maximum capacity" setting, in favor of some multiple of the configured concurrency. - We can likely do away with the "capacity threshold" setting, by resolving https://gitlab.com/gitlab-org/gitlab-ee/issues/5035. - We may be able to do away with the "maximum delay" setting, with a reasonable default and some "smarter" punishment for slowness and failure. /cc @tiagonbotelho

issue