Fix nesting installation
Part of gitlab-org/ci-cd/shared-runners/infrastructure#18 (closed)
Existing installation script messes file ownership of nesting
binary and nesting.plist
configuration file for LaunchD. The files should be used by root
, but are owned by the ec2-user
user. This makes the nesting service unable to start and the instance to be unusable.
The problem was discovered by observing the metrics and noticing constant fluctuations of instances creation/deletions events, as well as the number of available tasks decreased to 0:
The problem started at 2023-03-26 around 12:30 UTC. This is when - I guess - the autoscaling group was asked to create a new instance and took the new AMI definition into account. That new AMI (which we've updated at 2023-03-23 (
As the new AMI doesn't start nesting because of the problem described above, instances initialization was failing. In the logs, we can see repeated entries like:
Mar 27 15:55:44 runners-manager-saas-macos-staging-blue-1 gitlab-runner[111278]: 2023-03-27T15:55:44.753Z [INFO] instance discovery: group=aws/us-east-1/r-saas-m-staging-blue-1 id=i-0486d9b83b32babb1 state=creating cause=requested
Mar 27 15:55:50 runners-manager-saas-macos-staging-blue-1 gitlab-runner[111278]: 2023-03-27T15:55:50.733Z [INFO] instance update: group=aws/us-east-1/r-saas-m-staging-blue-1 id=i-0486d9b83b32babb1 state=running
Mar 27 16:01:19 runners-manager-saas-macos-staging-blue-1 gitlab-runner[111278]: 2023-03-27T16:01:19.082Z [ERROR] ready up preparation failed: instance=i-0486d9b83b32babb1 err="initializing nesting: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing ssh: rejected: connect failed (open failed)\""
Mar 27 16:01:20 runners-manager-saas-macos-staging-blue-1 gitlab-runner[111278]: 2023-03-27T16:01:20.146Z [INFO] instance update: group=aws/us-east-1/r-saas-m-staging-blue-1 id=i-0486d9b83b32babb1 state=deleting
Mar 27 16:09:04 runners-manager-saas-macos-staging-blue-1 gitlab-runner[111278]: 2023-03-27T16:09:04.544Z [INFO] instance pruned: group=aws/us-east-1/r-saas-m-staging-blue-1 id=i-0486d9b83b32babb1 lifetime=13m19.791388225s
which matches the fluctuations observed in the metrics.
After creating an instance manually from the template used by our autoscaling group, and logging into this instance, I was able to confirm that nesting
is not running. Trying to load /Library/LauhchDaemons/nesting.plist
I've got the error message that permissions are wrong. This is what brought us to this MR