Taskscaler capacity scheduling is not updated when runner config autoscaling settings are adjusted

Overview

I have a runner lifecycle configuration where a new version of a given runner is rolled out and registered, and upon registering and vetting that runner, it gets rotated into service via tag adjustments. When that occurs, the previous version of that runner is paused and has its idle pool scaled to zero. The mechanism by which I am aiming to achieve that on fleeting/taskscaler is by dropping idle_count to 0, scale_factor to 0.0, and setting idle_time to "5s".

For the docker-machine executor, a comparable approach works fine. What I've noticed with taskscaler is the config file watch works as intended, but when the runners.autoscaling_policy section(s) are updated, taskscaler does not have its internal scheduling information updated to reflect those changes and adjust idle pool scaling to only contain currently acquired instances, scaling it ultimately to zero as those jobs complete and the instances are reaped.

Restarting the runner process obviously works for refreshing the scheduling configuration because everything is re-initialized, but that tends to break running jobs since taskscaler then "forgets" that the instances are acquired and in use (or idle and previously tracked by taskscaler before the service restart), and taskscaler then wipes out everything in the autoscaling group as preexisting ("no data on pre-existing instance so removing for safety").

Current workaround is to make the scaling adjustments and then watch for running job count to hit zero before restarting the runner service to terminate the idle pool, but that's not a great solution long-term.

Proposal

{placeholder}

Edited Jun 26, 2024 by Darren Eastman

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information