Time to execute the change ( including a possible rollback )
Detailed steps for the change. Each step must include:
- pre-conditions for execution of the step, - execution commands for the step, - post-execution validation for the step , - rollback of the step
Details on changes coming - we need to coordinate with @erushton to pick the values and times we want to test these changes.
Test 1: Adjust warm pool for shared runner managers
- https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1784- To revert, revert the MR and run chef-client on SRM machines
Test 2: Adjust cron schedule for pipeline_schedule_worker from '19 * * * *'
- https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1806- To revert, revert the MR, apply it to chef, and run `chef-client` on the SRM machines.
I think we should test one thing at once so either increase the IdleCount, wait a day or two and see how this affects the system or change the cron definition for Pipeline shedules, wait a day or two and see how this affect the system. If we do both at once, we will not know which created the biggest change (if any).
Agreed @tmaczukin - I had wanted to make sure we started this change issue to map out what settings we want to experiment with, what values and times we will change those setting to ensure that we get a good isolated test. So we will make sure to only change one thing at a time.
Also, please notice that pipeline_schedule_worker cron setting is a GitLab setting, not Runner's. So to test both we need to update configuration of both GitLab.com and Shared Runners for GitLab.com.
As for values, from how I understand the autoscaling algorithm to work it should be a linear relationship between the idle count and the impact it has on the queue. So let's start by doubling it - even if it doesn't halve the queue time in these cases it will have a noticeable impact and we can then further refine it.
@tmaczukin what do you think? This would be increasing IdleCount from 50 to 100 on each shared runner manager.
I'm confused - why would the p99 be pinged at 15minutes for whole hours at a time? If it was purely a cron problem, wouldn't we see the spike at a particular time then recover as we scale up to accomodate?
This would seem to indicate that it's a problem with how fast we are emptying the queue versus it refilling. If the queue length is 50, and by the time we spin up 50 jobs another 50 have entered the queue, we wouldn't really reduce the length right? Are we hitting the concurrency caps on our SRM's?
@joshlambert - I relooked at my links - my 6 hour view for this week was incorrect and still pointed at Sept 4 rather than Sept 11.
If you look again, you see the cron peaks about every 20th minute of every hour. We don't have a span where we were stuck at 15min, but spikes every hour.
I would still think changing the cron to be more frequent, would make the height of those hourly peaks smaller, but possibly increase the frequency of smaller peaks. It's an experiment we can perform and change back as needed.
I spoke with @dawsmith about it and we think that we should try */20 so that we run it 3 times an hour. I see this as low risk since we are already putting a bunch of jobs in at once.
@joshlambert - circling back on the discussion of spend - there is actually more complication here. We have the purchased committed use discounts for CPU for gitlab-ci, so I'm not sure we'll see an increase in spend for CPU from this increase on the warm pool. I'll work with @davis_townsend - our new infra data analyst - to track this.
@dawsmith - Is there anything you want to look at in this regard yet?
Does this all show up in the gitlab-ci project usage, or is there more granular logging available somewhere to look at also?
@davis_townsend - yes this should show up in gitlab-ci project usage. The warm pool having a higher number of VMs waiting for jobs should increase our CPU seconds used per month for the project, but not our total spend since we should be below the committed use discount purchase. That is my theory that we should validate.
The cron change seems to create a significant change:
last 2 days (note, that 2019-09-15 is the end of Weekend and the different autoscaling configuration that we have for this off-peak time is now showing a significant change in the schedules handling)
last 24 hours
last 12 hours
last 3 hours
For an additional comparison:
2019-09-11
2019-09-13
2019-09-16 (we can see when @ahanselka applied the cron change at ~21 UTC)
2019-09-17
There is a visible change. However, we still have a lot of jobs started at specific hours (which in this case is user-defined in the Pipeline Schedule form), which causes lot of job to be started => queue handling performance to be descreased.
I'd say let's wait a day or two to see how the timing looks in a longer term. Currently, even with those spikes at some hours, it looks promising :)
just a note that we should update the documentation on GL.com settings when this work is completed if we end up with a different cron setting: https://docs.gitlab.com/ee/user/gitlab_com/
Are we at a point where we think we can just make this the default for GitLab.com?
Should we make this the default for self-managed as well? We could provide notice, but it seems like this is a good setting change to make more generally.
Do we know what the existing spikes are due to? Is it the daily jobs that are all being ran at once?
The current options seem like a recipe for a large influx of jobs at the same time.
@dawsmith I'm jup for trying */10, it can't hurt. the spike noted by @tmaczukin may be the daily cron default? The processed job's seems relatively steady, but the spike seems to signify that a large chunk of jobs enter the queue all at once?
Makes sense - @ahanselka, are you good to change to */10 and let us know when that is in. Then, after that we can watch for another few days.
That idea makes sense @joshlambert about at 4AM - I wonder if there is any way to splay that default? like generate a rand 0-59 and then 4 * * * for the cron pattern?
That idea makes sense @joshlambert about at 4AM - I wonder if there is any way to splay that default? like generate a rand 0-59 and then 4 * * * for the cron pattern?
We certainly could, although we'd need to alter the implementation. Right now the radio options just fill the cron text with specific values. We'd probably want to hide the cron entry in those cases, and randomize within the hour.
Based on this graph pictured below, there were no significant changes with our most recent change to */5 for the pipeline cron. As such, I think we should leave it where it is, get the doc change merged, and close this issue.