Dynamically determine mirror update interval based on total number of mirrors, average update time, and available concurrency

Replaces https://gitlab.com/gitlab-org/gitlab-ee/issues/3885.

If we allow the user to configure a maximum concurrency for mirror updates (which needs to be less than the actual Sidekiq concurrency), we can determine the ideal update interval based on total number of mirrors and their average update time, like this:

concurrency = 5 * 7 # GitLab.com has 5 sidekiq-pullmirror nodes, each with 7 threads reserved for mirror updates

recent_mirrors = Project.mirror.joins(:mirror_data).with_import_status('finished').where("projects.mirror_last_update_at > ?", Time.now - 24.hours)

mirrors_count = recent_mirrors.count

mirror_update_time = recent_mirrors.average('EXTRACT(epoch FROM projects.mirror_last_update_at - project_mirror_data.last_update_started_at)').to_f

total_update_time = mirrors_count * mirror_update_time

concurrent_update_time = total_update_time / concurrency

concurrent_update_time now holds the time it takes to update every mirror once.

After updating a mirror, we can calculate the next time it can safely update like so:

self.next_execution_timestamp = Time.now + concurrent_update_time.seconds * rand(1.0..1.2)

The addition of * rand(1.0..1.2) ensures that updates are spread out a little bit, and that there is some room for a sudden jump in number of mirrors.

On GitLab.com, these calculations work out to an ideal interval of just under 23 minutes:

> concurrency = 5 * 7 # GitLab.com has 5 sidekiq-pullmirror nodes, each with 7 threads reserved for mirror updates
=> 35
> recent_mirrors = Project.mirror.joins(:mirror_data).with_import_status('finished').where("projects.mirror_last_update_at > ?", Time.now - 24.hours); nil
=> nil
> mirrors_count = recent_mirrors.count
=> 19654
> mirror_update_time = recent_mirrors.average('EXTRACT(epoch FROM projects.mirror_last_update_at - project_mirror_data.last_update_started_at)').to_f
=> 2.44642611533372
> total_update_time = mirrors_count * mirror_update_time
=> 48082.05887076894
> concurrent_update_time = total_update_time / concurrency
=> 1373.7731105933983
> concurrent_update_time_minutes = concurrent_update_time / 60
=> 22.896218509889973

Looking at the actual mirror update data on GitLab.com, we see the same number. 99.8% of the projects that were last updated in the last 24 hours, were updated in the last 23 minutes:

> Project.mirror.joins(:mirror_data).with_import_status('finished').where("projects.mirror_last_update_at > ?", Time.now - 23.minutes).count
=> 19617

The 0.2% difference can be attributed to the fact that mirrors are transitioning between finished, scheduled and started constantly, which means the total size of the Project.mirror.joins(:mirror_data).with_import_status('finished') set changes along with them.

All of this means that on GitLab.com, we would end up updating each mirror somewhere between every 23 and 23 * 1.2 \approx 28 minutes.

For smaller instances, we would need some defaults so that an instance with a single mirror won't try to update it every single second.

Some more thoughts:

We can make MAX_RETRY automatically be determined based on the interval determined above, as suggested in https://gitlab.com/gitlab-org/gitlab-ee/issues/3885.
We currently punish slow updates by pushing their next_execution_timestamp further out into the future, and we could continue to do this by looking at the number of standard deviations by which their update time exceeds the average.
We can likely do away with the "maximum capacity" setting, in favor of some multiple of the configured concurrency.
We can likely do away with the "capacity threshold" setting, by resolving https://gitlab.com/gitlab-org/gitlab-ee/issues/5035.
We may be able to do away with the "maximum delay" setting, with a reasonable default and some "smarter" punishment for slowness and failure.

/cc @tiagonbotelho

Edited Mar 13, 2018 by Douwe Maan