Occasional macos job failures timing out on "Dialing nesting daemon"
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
In the saas-macos-medium-m1-basic-tests project I have a scheduled pipeline running every other hour.
We are getting an occasional job failures, like so:
- https://gitlab.com/gitlab-org/ci-cd/tests/saas-runners-tests/macos-platform/saas-macos-medium-m1-basic-tests/-/jobs/8954071551
- https://gitlab.com/gitlab-org/ci-cd/tests/saas-runners-tests/macos-platform/saas-macos-medium-m1-basic-tests/-/jobs/8952695364
- https://gitlab.com/gitlab-org/ci-cd/tests/saas-runners-tests/macos-platform/saas-macos-medium-m1-basic-tests/-/jobs/8948890215
- https://gitlab.com/gitlab-org/ci-cd/tests/saas-runners-tests/macos-platform/saas-macos-medium-m1-basic-tests/-/jobs/8941796715
This is impacting only 1% of all jobs currently
All of these failures are in the form of:
Running with gitlab-runner 17.7.0~pre.103.g896916a8 (896916a8)
on blue-2.saas-macos-medium-m1.runners-manager.gitlab.com/default tCo7q1eW, system ID: s_f10a77c3f6e9
Resolving secrets
Preparing the "instance" executor 01:00:00
Preparing instance...
Dialing instance i-06ba8edc53bb73918...
Instance i-06ba8edc53bb73918 connected
Enforcing VM Isolation
Creating nesting VM tunnel
Creating nesting VM macos-14-xcode-15
Dialing nesting daemon
ERROR: Job failed: execution took longer than 1h0m0s seconds
These start to build up on the macos instances that have been active the most and for the longest time period. This is a host degradation issue.
Most likely we can detect this repetitive failure and delete the host as suggested in Track instances health and remove unhealthy aft... (gitlab-org/fleeting/taskscaler#9)
Edited by 🤖 GitLab Bot 🤖