Orphaned fleeting plugin processes sometimes left running when a runner is stopped, resulting in unexpected scaling events and job failures.

Summary

A customer recently reported an issue with docker-autoscaler AWS EC2 instances being removed unexpectedly, causing jobs to fail (internal issue https://gitlab.com/gitlab-com/request-for-help/-/issues/2545).

Errors such as the following were seen in the runner log:

Mar 18 22:54:54 ip-10-50-14-53.pls.local gitlab-runner[604278]: ERROR: instance unexpectedly removed                instance=i-0dd3dc50e46215048 max-use-count=1 runner=t2_-65QKm slots=map[] subsystem=taskscaler used=0
Mar 18 22:55:54 ip-10-50-14-53.pls.local gitlab-runner[604278]: ERROR: out-of-sync capacity    

Reviewing the ASG events it showed the instance being removed due to a user request.

Investigation revealed that on the runner host in question there were multiple fleeting plugin processes running - one with a PPID of the active runner process, and the rest with a PPID of 1.

It appears that prior restarts of the runner had left orphan plugin processes running, and these were causing capacity changes to be made in the ASG.

It is possible that this only occurs when a runner is being run in the foreground (e.g. via gitlab-runner --debug run) and is stopped by typing CTRL-C. The issue can certainly be reproduced this way.

Killing the extra plugin processes returned the runner to a stable state, and the problem stopped happening.

Steps to reproduce

Orphaned plugin processes can be created by running the runner in the foreground and then interrupting it with CTRL-C.

I do not know if it may also sometimes occur when stopping the runner service via gitlab-runner stop/restart.

Actual behavior

Stopping/restarting the runner does not always stop the active fleeting plugin process.

Expected behavior

There should only be one fleeting plugin process (of the same type) running at a time, and it should be the one started by the runner process.

If necessary the runner startup should include a step to check for and kill any orphaned plugin processes before starting new ones.

Relevant logs and/or screenshots

job log
Add the job log

Environment description

config.toml contents
Add your configuration here

Used GitLab Runner version

Possible fixes

Edited by Justin Farmiloe