Are there instance provisioning cases that the current autoscaler settings (instance_ready_command,runners.autoscaler.connector_config.timeout) don't cater for?

We have an instance_ready_command setting introduced ~~here~~ which is run before a job is assigned to an instance and which on failure terminates the instance without interrupting the job, which waits for a new instance to become available.

The discussions around its implementation suggest it was introduced as a method to allow for the problem of jobs failing which have been assigned to instances that while having ssh connectivity have not completed their other start up tasks (e.g. installation of additional required software such as docker). But the command currently only appears to runs once as soon as SSH connectivity is established, and terminates the instance if the command fails. While this is acceptable for a previously working instance which for some reason encounters problems, it does not help with the prolonged startup issue.

It seems like this could be easily extended with timeout and/or retry settings that apply when an instance has yet to process a job - this would support waiting an appropriate (based on the startup configuration) amount of time for instance startup to complete without jobs failing.

[Description updated following discussion below]

We have an instance_ready_command setting introduced here which is run once before an instance is made available to accept jobs and which on failure terminates the instance without interrupting the job, which waits for a new instance to become available.

This enhances the default process of just retrying an SSH connection to the instance for a configured period and failing the job if a connection is unable to be established in the allowed time.

The command can include timeout/retry logic to enable it to wait for required services to become available.

The runners.autoscaler.connector_config.timeout setting controls how long the runner waits for the SSH connection to become available to run the instance_ready_command.

So this would seem to allow for situations where:

SSH connectivity is available early on in the instance provisioning process, but extra time is required for other services to be started before a job can be run
SSH connectivity is available late in the instance provisioning process

Other options to workaround long instance startup times are discussed here, including pre-installing all required software on the image, starting the ssh service last, a using AWS lifecycle hooks, but these all require changes to be made to resources outside of the runner configuration.

If there are scenarios that these settings do not cater for please add details to this issue.

Edited Apr 30, 2025 by Justin Farmiloe

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information