Changing autoscaler policy causes runners to stop accepting jobs
Summary
A GitLab premium customer reported (Internal ZD Ticket) that gitlab runners stopped accepting new jobs. After closer observation the runner was reporting offline to the gitlab but still seemed to finish the current running jobs. Support was able to identify that this was shortly after a autoscaler policy change. We've found that changing their autoscaler policy is causing runners to fail unexpectedly preventing new jobs from being picked up.
GitLab runner seems to be stuck in a sort of loop and doesn't move forward with grabbing new jobs or adjusting to the policy change. In the runner trace logs collected the following error was found:
executor: reserving taskscaler capacity: no capacity: no immediately available capacity
The customer pointed out that this error was coming from this section of the code . This seems to indicate that the capacity isn't being handled appropriately.
Steps to reproduce
The customer reported that they had 2 runner polices that switched between weekends and weekdays. During the switch from weekend to weekday they started experiencing the behavior noted above.
For the weekend policy they have idle_count=0 and weekdays have idle_count=6.
Actual behavior
On autoscaler policy changes gitlab runner stops accepting new jobs.
Expected behavior
Regardless of policy changes GitLab runner continues to operate as expected.
Relevant logs and/or screenshots
12705 19:36:32.921398 write(2<UNIX-STREAM:[64307->64310]>, "\33[37;1mFailed to process runner                          \33[0;m  \33[37;1mbuilds\33[0;m=0 \33[37;1merror\33[0;m=failed to update executor: reserving taskscaler capacity: no capacity: no immediately available capacity \33[37;1mexecutor\33[0;m=docker-autoscaler \33[37;1mmax_builds\33[0;m=120 \33[37;1mrunner\33[0;m=jHtzMyjDk\n", 303 <unfinished ...>Environment description
GitLab: Self-managed premium 16.7.4 omnibus
AWS docker autoscaler executor
Used GitLab Runner version
Runner: 17.5.2
Version: 17.5.2
Git revision: c6eae8d7
Git branch: 17-5-stable
GO version: go1.22.7
Possible Fixes
The customer highlighted these two sections of the code:
https://gitlab.com/gitlab-org/fleeting/taskscaler/-/blob/main/taskscaler.go#L448
Implementation
- {placeholder for implementation plan}