Reduce ci-runners/shared_runners saturation

added GitLab.com Resource Saturation Service Levels ~7793062 labels

changed the description

added Verify label

added 1 deleted label

mentioned in issue gitlab-com/www-gitlab-com#5341 (closed)

@andrewn thanks for writing this up. I'd like to clarify one thing - the scale up of machine hardware - that's just for the runner managers, correct?

Yes, just the runner managers.

At present these machines are pretty close to capacity: for example https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&from=now-2d&to=now&var-environment=gprd&var-node=shared-runners-manager-3.gitlab.com

Thanks for these details @andrewn. Scaling up these machines seems to make a lot of sense. Are there any downsides or risks to this change (outside of increased cost)?

@darbyfrey imho, the biggest risk is the reduction in runner managers while we upgrade each instance in turn. We should consider doing this in low traffic times (possibly the weekend)

@tmaczukin do you foresee any other risks?

As I mentioned yesterday on Slack - I'm going to handle a deploy of freshly released 12.4.1 version. This already requires the Runner process to be terminated with a graceful shutdown and is happening regularly, at least 2 times each month (for an RC1 and stable deployments). I can use this time to bump each of the managers up - it's already done 1-by-1 on each of the groups (prmX, gsrmX, srmX).

@tmaczukin what can I do to support you in this?

@ahanselka I will have an update to the cookbook that will require review, merge and version bump. I need to update the runner_update.sh script before I will start the deployment. Resizing a VM requires it to be turned off, which creates a break in the procedure defined by the upgrade script. And I definitely don't want to run all commands from runner_update.sh by hand

Cookbook update at gitlab-cookbooks/cookbook-wrapper-gitlab-runner!24 (merged).

I've updated prmX managers to 12.4.1 and I've resized the machines. From 2vCPU/4GB to 10vCPU/16GB. Let's wait until tomorrow evening and see how the CPU/memory usage looks during the high load time. We can track the usage with:

But the change already looks optimistic :). Tomorrow at evening we can bump the limit settings for prmX.

Tomorrow I will also follow with the deployment of 12.4.1 on srmX and gsrmX

As for now it looks really good :).

Looking on this graph https://dashboards.gitlab.net/d/000000159/ci?orgId=1&from=now-3h&to=now&refresh=5m&fullscreen&panelId=139&var-runner_type=private-runners we can see, that only two [[runners]] "workers" are generating the most of usage of this managers. These are the runners added for the gitlab-org group on GitLab.com. Looking on the settings of prmX we have:

concurrent = 400
limit = 300 (for the gitlab-org entry)
limit = 200 (for gitlab-com, charts and dev.gitlab.org entries).

Let's try to double the capacity of gitlab-org entry. Changing it limit to 600 requires us to bump the concurrent to 800 (to leave some place for the other workers). Let's do it in three steps:

Increase to 125% of current value (limit = 375, concurrent = 500)
Increase to 150% of current value (limit = 450, concurrent = 600)
Increase to 200% of current value (limit = 600, concurrent = 800)

1st step in https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2101.

I don't see noticeable increase on the CPU/memory usage graphs, so let's go with step 2.

2nd step in https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2105.

For the record - step 2nd deployed on ~19:15 UTC

Let's wait with step 3 until tomorrow. To gather more data how increase by 50% changes the situation :)

For completion, these are the graphs that I'm looking on:

srmX and gsrmX managers were updated and resized from 6vCPU/8GB to 12vCPU/16GB. Resources utilization can be followed at:

All looks really nice. I'm going to deploy step 3rd for prmX and start increasing limits for srmX and gsrmX. However, I'll skip the 125% step and just start from 150% increase.

For srmX we would do:

Increase to 150% of current value (limit = 525, concurrent = 525)
Increase to 200% of current value (limit = 700, concurrent = 700) - definitely needs double check on how the resources are utilized before making this step

For gsrmX I think we need for now only increase the GitLab.com worker, so:

Increase to 150% of current value (limit = 600, concurrent = 750)
Increase to 200% of current value (limit = 800, concurrent = 1000) - definitely needs double check on how the resources are utilized before making this step

srmX update in https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2107

gsrmX update in https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2108

And srmX and gsrmX limits are increased by 150% of initial value since ~14:36 UTC. Runners should soon reload the configuration files and start handling more jobs :)

Step 3rd for prmX is at https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2113

And step 3rd for prmX was applied at ~15:28 UTC.

@andrewn @erushton I'm out until 17th, and I will not do the step 2nd for srmX and gsrmX. But looking on the resource usage after bumping these two manager types by 150%, we should be fine to set the final values.

If anyone wants to do this while I'm out:

proposed values are here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8302#note_240869899
MRs for the previous step (use as a template) are here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8302#note_240878441

@andrewn would you be up for bumping the values next week?

The second step bump will be completed in:

gsrm: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2149

srm: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2148

@ahanselka @cindy before we go-ahead with merging https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2148/ https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2148 we have to update our docker-machine version on the shared Runners to v0.16.2-gitlab.3 similar to what we have done in https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2110/ & https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2112/diffs for the prmX. The reason for this is we have implemented a fix that should reduce the API calls we do on GCP. At the moment we are pretty close to the API quota limits, so if merge the above MRs without updating, and then validating the docker-machine version bump there is a high chance that we end up reaching the Quota.

@tmaczukin said that he will upgrade the Docker version himself, so after we do that and verify that the quota went down we can go ahead and deploy your changes

/cc @andrewn

Docker Machine is updated with https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2184. Let's wait a day or two to see if there is any significant change in the API usage pattern.

I'll leave a note here if all will go well :)

@dawsmith @ahanselka (heads-up!) once we get a go-ahead from @tmaczukin (see previous comment), do you think we would be able to roll out https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2148/ https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2148 this week?

@andrewn I think we can definitely do that.

Awesome. Thank you @ahanselka

The API read requests usage is significantly smaller and stable. I think we're ready to bump the capacity of srmX and gsrmX managers

before we go-ahead with merging https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2148/ https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2148

@ahanselka @cindy looks like we're ready to merge these

@andrewn these have been merged and applied!

@ahanselka We still should merge and apply https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2149

@tmaczukin oops I thought I did. I'm going to do so now.

added 1 deleted label

Note: we should upgrade all the runner managers at the same time as they all appear to be fairly saturated at present....

added wg_CIQueueStability label

adding @ahanselka or @cmcfarland - would you have time to also pitch in on this?

@dawsmith certainly. It sounds like this issue is about increasing the system resources on the runners managers. @tmaczukin said he would be able to do this at the same time as the runner upgrade and I have asked how I can support him in this.

changed milestone to %CI/CD & Enablement - November 2019

mentioned in merge request gitlab-cookbooks/cookbook-wrapper-gitlab-runner!24 (merged)

mentioned in issue delivery#546 (closed)

marked this issue as related to #7814 (closed)

added severity3 label

assigned to @ahanselka

added workflow-infraTriage label and removed 1 deleted label

We've already applied all of the changes in this issue, thus I think we can close it now.

closed

added teamReliability label and removed 1 deleted label

marked this issue as related to #13277 (closed)

added workflow-infraDone label and removed workflow-infraTriage label

Reduce ci-runners/shared_runners saturation

Designs

Child items ...

Activity

Reduce ci-runners/shared_runners saturation

Relates to

Activity