Skip to content

Fix nesting installation

Tomasz Maczukin requested to merge fix-nesting-installation into master

Part of gitlab-org/ci-cd/shared-runners/infrastructure#18 (closed)

Existing installation script messes file ownership of nesting binary and nesting.plist configuration file for LaunchD. The files should be used by root, but are owned by the ec2-user user. This makes the nesting service unable to start and the instance to be unusable.

The problem was discovered by observing the metrics and noticing constant fluctuations of instances creation/deletions events, as well as the number of available tasks decreased to 0:

Screenshot_2023-03-27_at_18-10-45_ci-runners_Incident_Support_autoscaling-new_-CI_Runners-Dashboards-_Grafana

Screenshot_2023-03-27_at_18-11-12_ci-runners_Incident_Support_autoscaling-new_-CI_Runners-Dashboards-_Grafana

The problem started at 2023-03-26 around 12:30 UTC. This is when - I guess - the autoscaling group was asked to create a new instance and took the new AMI definition into account. That new AMI (which we've updated at 2023-03-23 (👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5304). This also shows how long the existing instances were able to handle the small, test load that we generate at this moment 😉

As the new AMI doesn't start nesting because of the problem described above, instances initialization was failing. In the logs, we can see repeated entries like:

Mar 27 15:55:44 runners-manager-saas-macos-staging-blue-1 gitlab-runner[111278]: 2023-03-27T15:55:44.753Z [INFO]  instance discovery: group=aws/us-east-1/r-saas-m-staging-blue-1 id=i-0486d9b83b32babb1 state=creating cause=requested
Mar 27 15:55:50 runners-manager-saas-macos-staging-blue-1 gitlab-runner[111278]: 2023-03-27T15:55:50.733Z [INFO]  instance update: group=aws/us-east-1/r-saas-m-staging-blue-1 id=i-0486d9b83b32babb1 state=running
Mar 27 16:01:19 runners-manager-saas-macos-staging-blue-1 gitlab-runner[111278]: 2023-03-27T16:01:19.082Z [ERROR] ready up preparation failed: instance=i-0486d9b83b32babb1 err="initializing nesting: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing ssh: rejected: connect failed (open failed)\""
Mar 27 16:01:20 runners-manager-saas-macos-staging-blue-1 gitlab-runner[111278]: 2023-03-27T16:01:20.146Z [INFO]  instance update: group=aws/us-east-1/r-saas-m-staging-blue-1 id=i-0486d9b83b32babb1 state=deleting
Mar 27 16:09:04 runners-manager-saas-macos-staging-blue-1 gitlab-runner[111278]: 2023-03-27T16:09:04.544Z [INFO]  instance pruned: group=aws/us-east-1/r-saas-m-staging-blue-1 id=i-0486d9b83b32babb1 lifetime=13m19.791388225s

which matches the fluctuations observed in the metrics.

After creating an instance manually from the template used by our autoscaling group, and logging into this instance, I was able to confirm that nesting is not running. Trying to load /Library/LauhchDaemons/nesting.plist I've got the error message that permissions are wrong. This is what brought us to this MR 🙂

Edited by Tomasz Maczukin

Merge request reports