Orphaned fleeting plugin processes sometimes left running when a runner is stopped, resulting in unexpected scaling events and job failures.
Summary
A customer recently reported an issue with docker-autoscaler AWS EC2 instances being removed unexpectedly, causing jobs to fail (internal issue https://gitlab.com/gitlab-com/request-for-help/-/issues/2545).
Errors such as the following were seen in the runner log:
Mar 18 22:54:54 ip-10-50-14-53.pls.local gitlab-runner[604278]: ERROR: instance unexpectedly removed instance=i-0dd3dc50e46215048 max-use-count=1 runner=t2_-65QKm slots=map[] subsystem=taskscaler used=0
Mar 18 22:55:54 ip-10-50-14-53.pls.local gitlab-runner[604278]: ERROR: out-of-sync capacity
Reviewing the ASG events it showed the instance being removed due to a user request.
Investigation revealed that on the runner host in question there were multiple fleeting plugin processes running - one with a PPID of the active runner process, and the rest with a PPID of 1.
It appears that prior restarts of the runner had left orphan plugin processes running, and these were causing capacity changes to be made in the ASG.
It is possible that this only occurs when a runner is being run in the foreground (e.g. via gitlab-runner --debug run) and is stopped by typing CTRL-C. The issue can certainly be reproduced this way.
Killing the extra plugin processes returned the runner to a stable state, and the problem stopped happening.
Steps to reproduce
Orphaned plugin processes can be created by running the runner in the foreground and then interrupting it with CTRL-C.
I do not know if it may also sometimes occur when stopping the runner service via gitlab-runner stop/restart.
Actual behavior
Stopping/restarting the runner does not always stop the active fleeting plugin process.
Expected behavior
There should only be one fleeting plugin process (of the same type) running at a time, and it should be the one started by the runner process.
If necessary the runner startup should include a step to check for and kill any orphaned plugin processes before starting new ones.
Relevant logs and/or screenshots
job log
Add the job log
Environment description
config.toml contents
Add your configuration here