Decided to move the runner manager used for gitlab-qa runs in package-and-build pipelines to a folder owned by Quality
A new GCP project named gitlab-qa-runners was created for this purpose
Will create a new VM in this project and get gitlab-runner installed in it. Auto-scaled VMs using docker-machine will also be created in this project.
QA jobs that require higher computing power (anchor high_capacity in .gitlab-ci.yml file) are managed by the runner build-trigger-runner-manager-gitlab-org. This was setup when QA jobs from CE, EE and omnibus-gitlab jobs were run in the omnibus-gitlab pipeline. However, that situation have now changed and the jobs are now running in the QA pipeline itself, and are just triggered from omnibus-gitlab pipeline after package and docker jobs.
The ownership/maintenance of this runner manager was initially with Distribution team. Since now it is being used by only QA pipelines, I think it is right to hand over the ownership to Quality team.
PS: For completeness, here are some additional info
QA jobs that doesn't required higher computing power are handled by shared auto-scaling runners.
Package build and Docker build jobs that are part of triggered pipeline in omnibus-gitlab are managed by another runner - build-trigger-runners-manager-gcptriggered-builds-runner-manager-gcp
Well, I am not sure what exactly is actionable here. A section in https://about.gitlab.com/handbook/engineering/quality/ regarding infrastructure used by Quality team, maybe? @meks I am assigning to you, since you can decide/delegate appropriately.
The situation did change, but we still use machines from this runner manager to build packages so instead of just handing over the keys to the setup we should probably split the use-cases. Basically, the current manager boots up machines for triggered builds and also for qa builds. The latter we can remove from the manager and spin off into a new manager maintained by Quality.
@ddavison If you are the one to hand this over to, I am happy to guide you through it. You will need to complete couple of items first before we can move forward:
I suggest creating a Quality folder in GCP and connect it to our billing. There you would ideally create a project that would hold the manager machine. Example of how I set this up for Distribution in GCP
Access to chef-repo so that you can manage the infrastructure
Both of those you'll need to request access from the infra team.
QA jobs that require higher computing power (anchor high_capacity in .gitlab-ci.yml file) are managed by the runner build-trigger-runner-manager-gitlab-org.
I'm unsure if we should try to target this for %12.4 or %12.5. I'd like to get the multi project pipeline for our triggers to simplify the current state and dogfood the features but unsure what we'd need to prioritize it over.
@kwiebers That's probably just it (@balasankarc please correct me if I'm wrong), but I guess that involves a few access requests back and forth etc. I think %12.5 is fine here.
@markglenfletcher I was wondering if we should also move the gitlab-insights-runner runner dedicated to https://gitlab.com/gitlab-org/gitlab-insights to this centralized place? Moreover, it seems this runner isn't documented anywhere at the moment?
@kwiebers@rymai If I am given temporary access to QA folder in GCP, I am happy to create a new runner manager machine there and do what is necessary to move the runner over.
@rymai@kwiebers Do we want the runner manager VM (where we install gitalb-runner and docker-machine) and the auto-scaled-by-docker-machine VMs (where the CI jobs actually run) to be created inside the existing gitlab-qa-resources project? Or do you think a new project gitlab-qa-runners under the Gitlab QA Projects folder is better, so we can isolate everything QA runner related to this project?
Or do you think a new project gitlab-qa-runners under the Gitlab QA Projects folder is better, so we can isolate everything QA runner related to this project?
Update: A new project gitlab-qa-runners was created, and I was given access to it. Next step: Spin up a machine that is managed by chef-repo, get runner and docker-machine installed.
I decided to unblock myself and get the machine created manually and later Chef-ify it.
@kwiebers I would need the Runner registration token from CI/CD settings of https://gitlab.com/gitlab-org/gitlab-qa-mirror. Could you either send it to me via 1password, or make me a Maintainer of that project with an expiration date of 7 days so I can do whatever necessary (the project is a non-release mirror, so I don't think an access request will be required).
@balasankarc I've set you as a maintainer at gitlab-org/gitlab-qa-mirror> which expires on 2021-10-30. Let me know if you ever need to extend it. Thanks for taking care of this!
@mlapierre reminded me that the existing runner build-trigger-runner-manager-gitlab-org is also used in gitlab-qa, which means we need to register a new runner to replace that. I will do that once I get the GCP project and current VM managed by Infra's terraform setup.
Registered a runner named qa-runner for gitlab-qa project.
Also, created an account for @rymai in the VM to avoid me being a single point of failure. Once all of this moves to official infra, we can setup authentication and access properly and get myself removed.
I went ahead and created an epic &6989 (closed) to house all these issues (this one, #651 (closed) , anything else that might pop up). This is a child epic of &5933, and is specifically targetted on the infrastructure maintenance of these runners.
@balasankarc Is it possible to bump instance type used for the qa-runner?
I have been investigating some of the test flakiness occurring in tests and after adding some monitoring of resource consumption found that test runs keep hitting the cpu ceiling quite constantly. I ran some comparison on runners that have larger vCPU core count and found that environment can require quite a bit more cpu resources during peak loads.
current run constantly hitting the 2 vCPU ceiling:
unrestricted run can utilize 4 cores and median hovers around 3:
Some of the configurations that spawn more containers can also use a bit more memory. praefect runs can sometimes fail with out of memory errors.
Here is an example where it causes a job to essentially timeout and a lot of tests to fail because environment essentially isn't capable of handling all of the load:
I can see that there are also qa-runner and qa-mirror-runner for gitlab-qa and it's mirror projects with the same tags which means they would be picking up jobs as well. Can we bump those too or disable for now until fully migrated?
Also a question, is there any specific reason for using n2d instance type which is AMD if I'm not mistaken?
Also a question, is there any specific reason for using n2d instance type which is AMD if I'm not mistaken?
No particular reason right now. When we initially set the runner up, these were the only ones available. And then when we added qa-runner and qa-mirror-runner, we just imitated the existing config so as not to have too many surprises. If someone is up for doing a comparison, I can help with spinning up new runners using the new instance types and we can look at changing instance types.
@balasankarc could You please bump the instance type to n2d-standard-4 for other 2 runners, especially the mirror one?
I can see that we now have some of the jobs running on updated instances but quite a lot still aren't. This did provide additional interesting feedback as I can see now that the failed jobs are mostly concentrated on the smaller not updated instances.
@acunskis Sorry for the delay - got caught up on a production incident. I've updated the other two runners, and the new machines that spin up will be of the new instance type.
@acunskis The MaxBuilds value is 10, which means each machine is used for 10 jobs and is only cleaned up only after that. Maybe some of them still haven't reached that threshold?