AWS fleeting plugin tries to use machine before it has fully booted
Summary
The AWS fleeting plugin (beta 0.4) tries to use the EC2 instance before it has fully booted. This is an issue for instances that have a user-data script which takes a bit of time to run (for example to install pre-reqs not in the AMI and to format SSDs etc).
Steps to reproduce
- Follow the instructions to setup autoscaling (https://gitlab.com/gitlab-org/fleeting/fleeting-plugin-aws and https://docs.gitlab.com/runner/executors/docker_autoscaler.html#:~:text=The%20Docker%20Autoscaler%20executor%20is,uses%20fleeting%20plugins%20to%20autoscale)
- In the launch template for your auto-scaling group, add a user-data script that takes 2-3 minutes to run (in our case, it installs docker, adds the ec2-user to the docker group, formats the SSD and mounts it at /var/lib/docker)
- Create a CI job that runs inside a docker container on your new runner
- This should start a new EC2 instance.
Actual behavior
The job completes (fails in our case) before the EC2 instance has fully booted. If the job is re-run on the same auto-scaled instance then it often runs without issue (as boot has completed by that point).
Expected behavior
The job should not have started until the EC2 instance has fully booted (waiting for /var/lib/cloud/instance/boot-finished, or using cloud signals) and would then have completed successfully.
Environment description
OS: Amazon Linux 2
Used GitLab Runner version
Version: 16.6.1
Git revision: f5da3c5a
Git branch: 16-6-stable
GO version: go1.20.10
Built: 2023-11-24T21:11:36+0000
OS/Arch: linux/arm64
Possible fixes
- Modify the fleeting plugin so that it waits for /var/lib/cloud/instance/boot-finished, or waits on a cloud signal before treating the runner as "ready".
Edited by Chris Pringle