Add in-AMI watchdog to periodically recover nesting from bad state
## Problem The nesting daemon running in the macOS AMI periodically enters a bad state with "dead" VMs — VM directories that are stale and no longer associated with active jobs. When this happens, nesting continues to report these dead VMs as active, which prevents the host from being fully utilised and eventually requires manual intervention via SSH to run the recovery script. See also: https://gitlab.com/gitlab-org/ci-cd/shared-runners/infrastructure/-/work_items/278 Also see conversation around re-architecting nesting to use process-per-VM model which should fix this problem in the longer term - https://gitlab.com/gitlab-org/fleeting/nesting/-/work_items/13#note_3228348505 ## Proposed Solution Add an in-AMI watchdog as a macOS LaunchDaemon that runs every 6 hours and safely recovers nesting when dead VMs are detected. The watchdog follows the same logic as the existing `safe-reboot-nesting.sh` script (used for manual SSH-based recovery): 1. Find VM directories older than 4 hours (dead VMs). If none exist, exit — no action needed. 2. Wait up to 3 hours for any live VMs (active jobs) to finish draining. 3. If live VMs have not drained after 3 hours, exit without restarting (retry next run). 4. Delete the nesting working directory contents and restart the nesting service via `launchctl kickstart -k system/nesting`. **New files:** - `assets/nesting-watchdog.sh` — the watchdog shell script - `assets/nesting-watchdog.plist` — LaunchDaemon plist (`StartInterval: 21600`, `RunAtLoad: false`) - Update `scripts/provisioner/20_install_nesting.sh` to install both This is a short-term mitigation. The root cause needs to be fixed in nesting itself. ## Testing - [x] Verify daemon launch in integration tests - [x] Test script on machines with "fake" dead VMs to ensure the script runs successfully - [x] Test rollout to staging, verifying the job runs on the 6 hour schedule and writes relevant logs - [x] Simulate dead VM in staging, verifying the VMs are cleaned and nesting restarted
issue