Ownership of runner manager machine used for QA jobs

changed the description

Makes sense to me!

added Quality label

@balasankarc sounds good, thanks for raising this!

Well, I am not sure what exactly is actionable here. A section in https://about.gitlab.com/handbook/engineering/quality/ regarding infrastructure used by Quality team, maybe? @meks I am assigning to you, since you can decide/delegate appropriately.

assigned to @meks

@balasankarc Also, we'll need as much doc as is already available regarding runner managers' management (at least I'm not familiar with that topic).

@ddavison FYI, this would be related to the QA runners as well.

mentioned in issue #283 (closed)

The situation did change, but we still use machines from this runner manager to build packages so instead of just handing over the keys to the setup we should probably split the use-cases. Basically, the current manager boots up machines for triggered builds and also for qa builds. The latter we can remove from the manager and spin off into a new manager maintained by Quality.

@ddavison If you are the one to hand this over to, I am happy to guide you through it. You will need to complete couple of items first before we can move forward:

I suggest creating a Quality folder in GCP and connect it to our billing. There you would ideally create a project that would hold the manager machine. Example of how I set this up for Distribution in GCP
Access to chef-repo so that you can manage the infrastructure

Both of those you'll need to request access from the infra team.

assigned to @ddavison

unassigned @meks

changed milestone to %11.1

added Qualitytest-infrastructure label

changed milestone to %11.2

changed milestone to %11.3

added priority1 label

changed milestone to %11.4

changed milestone to %11.6

changed milestone to %12.0

changed milestone to %11.11

changed milestone to %Backlog

mentioned in issue omnibus-gitlab#4717 (closed)

@kwiebers I think we should tackle that in order to finish omnibus-gitlab#4717 (closed).

QA jobs that require higher computing power (anchor high_capacity in .gitlab-ci.yml file) are managed by the runner build-trigger-runner-manager-gitlab-org.

@balasankarc By looking at https://ops.gitlab.net/gitlab-cookbooks/chef-repo/blob/7d1a95ddf2cb7ce9c9b9f8fa95a2a1f11c2e94f6/roles/build-trigger-runner-manager-gitlab-org.json#L128, it seems we're not using high CPU machines but standard ones?

@rymai Hmm. Seems you are right - I may have overlooked. (~1 year ago - I don't actually remember what I was thinking then. 😂 )

@rymai - Do we have an idea of the level of effort for this?

Is it more than #261 (comment 84723294) + changes to https://ops.gitlab.net/gitlab-cookbooks/chef-repo/blob/7d1a95ddf2cb7ce9c9b9f8fa95a2a1f11c2e94f6/roles/build-trigger-runner-manager-gitlab-org.json#L128 to reflect the machine type?

I'm unsure if we should try to target this for %12.4 or %12.5. I'd like to get the multi project pipeline for our triggers to simplify the current state and dogfood the features but unsure what we'd need to prioritize it over.

Is it more than #261 (comment 84723294) + changes to https://ops.gitlab.net/gitlab-cookbooks/chef-repo/blob/7d1a95ddf2cb7ce9c9b9f8fa95a2a1f11c2e94f6/roles/build-trigger-runner-manager-gitlab-org.json#L128 to reflect the machine type?

@kwiebers That's probably just it (@balasankarc please correct me if I'm wrong), but I guess that involves a few access requests back and forth etc. I think %12.5 is fine here.

added Engineering Productivity label

added priority2 label and removed priority1 label

@ddavison - are you actively working on this or can we take this over on the EP team?

I'll unassign myself. 👍

unassigned @ddavison

changed milestone to %12.5

added [deprecated] Accepting merge requests label

@markglenfletcher I was wondering if we should also move the gitlab-insights-runner runner dedicated to https://gitlab.com/gitlab-org/gitlab-insights to this centralized place? Moreover, it seems this runner isn't documented anywhere at the moment?

It's currently a shell runner residing on the same machine that hosts the app

Opened a documentation MR at gitlab-insights!171 (merged).

changed the description

mentioned in merge request gitlab-insights!171 (merged)

mentioned in issue #423 (closed)

marked this issue as related to #423 (closed)

mentioned in issue #429

marked this issue as related to #429

marked this issue as related to #206 (closed)

marked this issue as related to #135 (closed)

added 1 deleted label

removed 1 deleted label

added missed:12.5 label

changed milestone to %12.6

mentioned in merge request omnibus-gitlab!3666 (merged)

mentioned in merge request gitlab!18279 (closed)

Blocked by https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8452.

added workflowblocked label

added 1 deleted label

added missed:12.6 label

changed milestone to %12.7

Unblocking this due to https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8452 being completed! 🎉

removed workflowblocked label

changed milestone to %12.8

changed milestone to %12.9

added missed:12.8 label

changed milestone to %12.10

added missed:12.9 label

changed milestone to %13.0

added missed:12.10 label

changed milestone to %13.1

added missed:13.0 label

Adding for my reference, @jennielouie and I were not able to see the load on these machines related to stalled builds with https://gitlab.com/gitlab-org/gitlab-qa/pipelines/152071622 to resolve gitlab#220061 (closed)

changed milestone to %13.2

changed milestone to %13.4

changed milestone to %13.5

added missed:13.4 label

removed milestone %13.5

mentioned in issue gitlab-org/quality/triage-reports#413 (closed)

changed milestone to %Backlog

added to epic &5933

changed the description

@kwiebers @rymai If I am given temporary access to QA folder in GCP, I am happy to create a new runner manager machine there and do what is necessary to move the runner over.

@balasankarc - This should go through a GCP access request. I created https://gitlab.com/gitlab-com/team-member-epics/access-requests/-/issues/11756.

Moving the discussion here from https://gitlab.com/gitlab-com/team-member-epics/access-requests/-/issues/11756#note_694146782

According to https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/8452#note_269470663, there is a project named gitlab-qa-resources under the Gitlab QA Projects folder in GCP.

@rymai @kwiebers Do we want the runner manager VM (where we install gitalb-runner and docker-machine) and the auto-scaled-by-docker-machine VMs (where the CI jobs actually run) to be created inside the existing gitlab-qa-resources project? Or do you think a new project gitlab-qa-runners under the Gitlab QA Projects folder is better, so we can isolate everything QA runner related to this project?

Or do you think a new project gitlab-qa-runners under the Gitlab QA Projects folder is better, so we can isolate everything QA runner related to this project?

@balasankarc I think this is preferable! 👍🏼

Update: A new project gitlab-qa-runners was created, and I was given access to it. Next step: Spin up a machine that is managed by chef-repo, get runner and docker-machine installed.

Opened https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14333

I decided to unblock myself and get the machine created manually and later Chef-ify it.

@kwiebers I would need the Runner registration token from CI/CD settings of https://gitlab.com/gitlab-org/gitlab-qa-mirror. Could you either send it to me via 1password, or make me a Maintainer of that project with an expiration date of 7 days so I can do whatever necessary (the project is a non-release mirror, so I don't think an access request will be required).

I don't have access to add you. @godfat-gitlab - looks like you are a maintainer can you add @balasankarc for a 7 day expiration to gitlab-qa-mirror?

@balasankarc I've set you as a maintainer at gitlab-org/gitlab-qa-mirror> which expires on 2021-10-30. Let me know if you ever need to extend it. Thanks for taking care of this!

New runner registered and seems to be picking up jobs.

@mlapierre reminded me that the existing runner build-trigger-runner-manager-gitlab-org is also used in gitlab-qa, which means we need to register a new runner to replace that. I will do that once I get the GCP project and current VM managed by Infra's terraform setup.

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14333#note_713904317 on getting this into the official infrastructure setup.

Registered a runner named qa-runner for gitlab-qa project.

Also, created an account for @rymai in the VM to avoid me being a single point of failure. Once all of this moves to official infra, we can setup authentication and access properly and get myself removed.

I went ahead and created an epic &6989 (closed) to house all these issues (this one, #651 (closed) , anything else that might pop up). This is a child epic of &5933, and is specifically targetted on the infrastructure maintenance of these runners.

Adding to the Chef infra: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/839

Update: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3554 merged getting the VM into terraform.

changed the description

assigned to @balasankarc

mentioned in epic &6989 (closed)

changed epic to &6989 (closed)

mentioned in issue #651 (closed)

mentioned in issue gitlab#343863 (closed)

@balasankarc Is it possible to bump instance type used for the qa-runner?

I have been investigating some of the test flakiness occurring in tests and after adding some monitoring of resource consumption found that test runs keep hitting the cpu ceiling quite constantly. I ran some comparison on runners that have larger vCPU core count and found that environment can require quite a bit more cpu resources during peak loads.

current run constantly hitting the 2 vCPU ceiling:

unrestricted run can utilize 4 cores and median hovers around 3:

Some of the configurations that spawn more containers can also use a bit more memory. praefect runs can sometimes fail with out of memory errors.

Here is an example where it causes a job to essentially timeout and a lot of tests to fail because environment essentially isn't capable of handling all of the load:

https://gitlab.com/gitlab-org/gitlab-qa/-/jobs/2072927498

So I will slightly rephrase my statement, we need to bump the resources for our e2e test runners 😸

@acunskis Sure, we can do that. Wanna open an MR updating https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/blob/master/roles/build-trigger-runner-manager-gitlab-org.json#L80?

Sure thing, I can do that.

I can see that there are also qa-runner and qa-mirror-runner for gitlab-qa and it's mirror projects with the same tags which means they would be picking up jobs as well. Can we bump those too or disable for now until fully migrated?

Also a question, is there any specific reason for using n2d instance type which is AMD if I'm not mistaken?

Can we bump those too

Sure. Once https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1334 is in, I will update those runners also.

Also a question, is there any specific reason for using n2d instance type which is AMD if I'm not mistaken?

No particular reason right now. When we initially set the runner up, these were the only ones available. And then when we added qa-runner and qa-mirror-runner, we just imitated the existing config so as not to have too many surprises. If someone is up for doing a comparison, I can help with spinning up new runners using the new instance types and we can look at changing instance types.

Cool, I will try to find some time in the near future to do that. It would be interesting to see if we can benefit from changing the instance type 👍

@balasankarc could You please bump the instance type to n2d-standard-4 for other 2 runners, especially the mirror one?

I can see that we now have some of the jobs running on updated instances but quite a lot still aren't. This did provide additional interesting feedback as I can see now that the failed jobs are mostly concentrated on the smaller not updated instances.

@acunskis Sorry for the delay - got caught up on a production incident. I've updated the other two runners, and the new machines that spin up will be of the new instance type.

No worries! Awesome, thanks a lot. Hopefully now we should have less random flaky failures 🚀

@balasankarc Is it possible there will be some delay for update to kick in? I can see that jobs on qa-mirror-runner are still running on 2 CPU cores

@acunskis The MaxBuilds value is 10, which means each machine is used for 10 jobs and is only cleaned up only after that. Maybe some of them still haven't reached that threshold?

Ok, let's just wait for them to be cleaned up

Looks like machines rotated and cleaned up, so we should be good to go, thanks 🚀

mentioned in issue gitlab#350282 (closed)

Ownership of runner manager machine used for QA jobs

Designs

Child items 0

Activity

Ownership of runner manager machine used for QA jobs

Blocks

Relates to

Activity