Project 'gitlab-com/gl-infra/infrastructure' was moved to 'gitlab-com/gl-infra/production-engineering'. Please update any links and bookmarks that may still have the old path.
Although this saturation alert has just been merged and has yet to fire, it's fairly clear from the metric and its growth over the past few months, that it needs to be scaled up in order to cope with demand.
This graph shows the saturation on these machines over the past day, at a percentage, with 100% being completely saturated.
There are two approaches to scaling this fleet up to reduce saturation here:
Add more runner managers
Allow the existing runner managers to handle more runners
At present there is a great deal of pain around deployments for runners. This is related to the 3h graceful shutdown required for existing jobs to terminate. We discussed a stopgap using terraform until we can move to k8s but that is out of scope for this issue.
The deployment pain means that 1. is not viable, but 2. should be.
Unfortunately, the runner manager hosts appear to be running at the top limit at present, so we need to take the following actions to move forward:
Rotate the shared runner fleet over to more powerful machines to give us some more vertical headroom for scaling up
Increase the maximum number of runners per runner-manager
Thanks for these details @andrewn. Scaling up these machines seems to make a lot of sense. Are there any downsides or risks to this change (outside of increased cost)?
@darbyfrey imho, the biggest risk is the reduction in runner managers while we upgrade each instance in turn. We should consider doing this in low traffic times (possibly the weekend)
As I mentioned yesterday on Slack - I'm going to handle a deploy of freshly released 12.4.1 version. This already requires the Runner process to be terminated with a graceful shutdown and is happening regularly, at least 2 times each month (for an RC1 and stable deployments). I can use this time to bump each of the managers up - it's already done 1-by-1 on each of the groups (prmX, gsrmX, srmX).
@ahanselka I will have an update to the cookbook that will require review, merge and version bump. I need to update the runner_update.sh script before I will start the deployment. Resizing a VM requires it to be turned off, which creates a break in the procedure defined by the upgrade script. And I definitely don't want to run all commands from runner_update.sh by hand
I've updated prmX managers to 12.4.1 and I've resized the machines. From 2vCPU/4GB to 10vCPU/16GB. Let's wait until tomorrow evening and see how the CPU/memory usage looks during the high load time. We can track the usage with:
limit = 200 (for gitlab-com, charts and dev.gitlab.org entries).
Let's try to double the capacity of gitlab-org entry. Changing it limit to 600 requires us to bump the concurrent to 800 (to leave some place for the other workers). Let's do it in three steps:
Increase to 125% of current value (limit = 375, concurrent = 500)
Increase to 150% of current value (limit = 450, concurrent = 600)
Increase to 200% of current value (limit = 600, concurrent = 800)
All looks really nice. I'm going to deploy step 3rd for prmX and start increasing limits for srmX and gsrmX. However, I'll skip the 125% step and just start from 150% increase.
For srmX we would do:
Increase to 150% of current value (limit = 525, concurrent = 525)
Increase to 200% of current value (limit = 700, concurrent = 700) - definitely needs double check on how the resources are utilized before making this step
For gsrmX I think we need for now only increase the GitLab.com worker, so:
Increase to 150% of current value (limit = 600, concurrent = 750)
Increase to 200% of current value (limit = 800, concurrent = 1000) - definitely needs double check on how the resources are utilized before making this step
And srmX and gsrmX limits are increased by 150% of initial value since ~14:36 UTC. Runners should soon reload the configuration files and start handling more jobs :)
@andrewn@erushton I'm out until 17th, and I will not do the step 2nd for srmX and gsrmX. But looking on the resource usage after bumping these two manager types by 150%, we should be fine to set the final values.
@tmaczukin said that he will upgrade the Docker version himself, so after we do that and verify that the quota went down we can go ahead and deploy your changes
@dawsmith certainly. It sounds like this issue is about increasing the system resources on the runners managers. @tmaczukin said he would be able to do this at the same time as the runner upgrade and I have asked how I can support him in this.