Dynamically determine mirror update interval based on total number of mirrors, average update time, and available concurrency
Replaces https://gitlab.com/gitlab-org/gitlab-ee/issues/3885.
If we allow the user to configure a maximum concurrency for mirror updates (which needs to be less than the actual Sidekiq concurrency), we can determine the ideal update interval based on total number of mirrors and their average update time, like this:
concurrency = 5 * 7 # GitLab.com has 5 sidekiq-pullmirror nodes, each with 7 threads reserved for mirror updates
recent_mirrors = Project.mirror.joins(:mirror_data).with_import_status('finished').where("projects.mirror_last_update_at > ?", Time.now - 24.hours)
mirrors_count = recent_mirrors.count
mirror_update_time = recent_mirrors.average('EXTRACT(epoch FROM projects.mirror_last_update_at - project_mirror_data.last_update_started_at)').to_f
total_update_time = mirrors_count * mirror_update_time
concurrent_update_time = total_update_time / concurrency
concurrent_update_time
now holds the time it takes to update every mirror once.
After updating a mirror, we can calculate the next time it can safely update like so:
self.next_execution_timestamp = Time.now + concurrent_update_time.seconds * rand(1.0..1.2)
The addition of * rand(1.0..1.2)
ensures that updates are spread out a little bit, and that there is some room for a sudden jump in number of mirrors.
On GitLab.com, these calculations work out to an ideal interval of just under 23 minutes:
> concurrency = 5 * 7 # GitLab.com has 5 sidekiq-pullmirror nodes, each with 7 threads reserved for mirror updates
=> 35
> recent_mirrors = Project.mirror.joins(:mirror_data).with_import_status('finished').where("projects.mirror_last_update_at > ?", Time.now - 24.hours); nil
=> nil
> mirrors_count = recent_mirrors.count
=> 19654
> mirror_update_time = recent_mirrors.average('EXTRACT(epoch FROM projects.mirror_last_update_at - project_mirror_data.last_update_started_at)').to_f
=> 2.44642611533372
> total_update_time = mirrors_count * mirror_update_time
=> 48082.05887076894
> concurrent_update_time = total_update_time / concurrency
=> 1373.7731105933983
> concurrent_update_time_minutes = concurrent_update_time / 60
=> 22.896218509889973
Looking at the actual mirror update data on GitLab.com, we see the same number. 99.8% of the projects that were last updated in the last 24 hours, were updated in the last 23 minutes:
> Project.mirror.joins(:mirror_data).with_import_status('finished').where("projects.mirror_last_update_at > ?", Time.now - 23.minutes).count
=> 19617
The 0.2% difference can be attributed to the fact that mirrors are transitioning between finished
, scheduled
and started
constantly, which means the total size of the Project.mirror.joins(:mirror_data).with_import_status('finished')
set changes along with them.
All of this means that on GitLab.com, we would end up updating each mirror somewhere between every 23
and 23 * 1.2 \approx 28
minutes.
For smaller instances, we would need some defaults so that an instance with a single mirror won't try to update it every single second.
Some more thoughts:
- We can make
MAX_RETRY
automatically be determined based on the interval determined above, as suggested in https://gitlab.com/gitlab-org/gitlab-ee/issues/3885. - We currently punish slow updates by pushing their
next_execution_timestamp
further out into the future, and we could continue to do this by looking at the number of standard deviations by which their update time exceeds the average. - We can likely do away with the "maximum capacity" setting, in favor of some multiple of the configured concurrency.
- We can likely do away with the "capacity threshold" setting, by resolving https://gitlab.com/gitlab-org/gitlab-ee/issues/5035.
- We may be able to do away with the "maximum delay" setting, with a reasonable default and some "smarter" punishment for slowness and failure.
/cc @tiagonbotelho