Prepare CI infrastructure for bigger scale

There is a need to prepare our CI infrastructure for a bigger scale:

We execute more and more jobs from GitLab CE/EE and all their forks - this already utilize almost all resources for gitlab-shared-runners-manager-X Runners.
We plan to enable Auto DevOps by default for all projects on GitLab.com soon. This may significantly increase number of jobs executed on GitLab.com.

This issue is created to investigate what is the current state of our CI infrastructure and plan what should we change to prepare for observed and planned increase of the load.

At this moment - after quick investigation, I can say, that:

[DONE] We should calculate if and how much GCP quota limits increase we should request.
- prepare estimates (see: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4907#note_101169499)
- request quota increase
- confirm that quotas were increased
[DONE] Most of our Runner Manager machines are a 1CPU vms. We should update machines to be at least 2CPU vms (basing on current CPU and load metrics). Let's reconfigure CPU/RAM settings of our VMs (details at https://gitlab.com/gitlab-com/infrastructure/issues/4907#note_99819368)
- shared-runners-manager-3.gitlab.com
- shared-runners-manager-4.gitlab.com
- shared-runners-manager-5.gitlab.com
- shared-runners-manager-6.gitlab.com
- private-runners-manager-3.gitlab.com
- private-runners-manager-4.gitlab.com
- gitlab-shared-runners-manager-3.gitlab.com
- gitlab-shared-runners-manager-4.gitlab.com
- gitlab-shared-runners-manager-5.gitlab.com
- gitlab-shared-runners-manager-6.gitlab.com
- shared-runners-manager-3.staging.gitlab.com
- shared-runners-manager-4.staging.gitlab.com
[DONE] File descriptors - gitlab-runner process on our managers is managed by Systemd. This means, that it's started by default with a 1024 limit for max open file descriptors. As we can see already for gitlab-shared-runners-manager-X, this is the biggest blocker after CPU. We should add a possibility to manage this limit. Change in cookbook was implemented with gitlab-cookbooks/cookbook-wrapper-gitlab-runner!19 (merged). Deployment will be done with https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2487 but we will need for next Runner restart to see the limits applied. I'll try to do a machine-by-machine restart tomorrow, but if I'll not manage to do this, then at Monday we should have Runner 11.3.0-rc1 which should be deployed immediately after it will be released. This would then restart processes.
[DONE] On gitlab-shared-runners-manager-X and private-runners-manager-X we currently see, that with a bigger load and bigger volume of cache our current caching infrastructure (own servers using minio/minio Docker image) is slow and not reliable. With #4565 (closed) we're in the middle of moving to using GCS as caching backend. For now prmX and gsrmX machines were reconfigured and it seems that it works good. We should wait few days and see how it works and after GitLab Runner 11.3.0-rc1 will be released, we should move also shared-runners-manager-X to use GCS.
- shared-runners-manager-X
- gitlab-shared-runners-manager-X
- private-runners-manager-X
concurrent and limit settings are setting the upper limit of jobs that we can handle with a manager. After upgrading quota, number of CPUs and changing the max_open_files limits we should consider how much we can increase these settings. However bigger scaling of our capacity should be done by increasing the number of managers after deciding what concurrent and limit values are the best.
Number of managers - as described in previous point, after we find the best maximum number of jobs that can be handled by a runner manager, we should scale the infrastructure further by increasing the number of managers (this also aligns with the automated part of our plan described at #4813 (moved)).
API requests - increase of executed jobs means increase of API traffic generated by runners. We should estimate how big change it will be after number of handled jobs will go up.

We should estimate how big the increase of jobs after enabling Auto DevOps may be, and prepare the capacity for first 1%. Then, while gradual rollout will proceed, we should observe metrics and properly react with changing the number of managers.

To gather more data and make good decisions, I've added few new panels at https://dashboards.gitlab.net/d/000000159/ci. Look for:

cache performance,
number of handled jobs,
Runner manager resources usage row,

Since cache change for prm/gsrm was finished few hours ago and file descriptors change will be applied most probably tomorrow, to make some calculations and estimates we will need to wait until Tuesday/Wednesday next week. The above list already contains some things that should be done, so for now I'll proceed with doing what we already know that needs to be done.

By the way, at this moment Last 24 hours view shows 276841 started jobs on GitLab.com. In our discussion at Slack I saw an estimate that after enabling Auto DevOps we may expect a 1-5x increased load. This mean that we can hit our 1M jobs each day goal much sooner than we expected.

I also still need to read #2632 (closed) and #3160 (moved) again to catch everything that is not included in the description above. I'll update this issue with anything that I'll consider important.

Edited Sep 17, 2018 by Tomasz Maczukin