Fleeting plugins problem with `SIGQUIT`
Fleeting plugins are exiting too soon, when SIGQUIT
is used to initiate Runner's graceful shutdown mechanism. In most cases we want to terminate runner gracefully and there are multiple reasons for it and almost no reason to not care about that.
However, it seems that systemd sends SIGQUIT
signal not just to the runner process, but to the whole process group. And that includes also the plugin. And the plugin itself terminates very nasty immediately when it recognizes SIGQUIT
. More detailed description with some story and investigation around the problem can be found at gitlab-org/gitlab-runner!3769 (comment 1337520213).
To summarize the problem:
- For our new autoscaling runner uses taskscaler and fleeting libraries and we use fleeting plugins to interact with cloud providers.
- Our runner managers are maintaining
gitlab-runner
process through systemd (at least for now) - Systemd is configured to send
SIGQUIT
when requested to stop the Runner process. This has a 3 hour timeout connected after which systemd force-terminates the process by sendingSIGTERM
(which runner recognizes and tries to stop ASAP). - Most probably (for now it's my suspicion) systemd sends the signal to the whole process group and not just runner process.
- When the process group receives
SIGQUIT
runner initiates graceful shutdown, but fleeting plugin fails immediately - Non problematic (but noisy) side effect of that plugin termination is a stack trace dump that we see in logs.
- A very problematic side effect of that plugin termination is that it happens before fleeting and taskscaler can graceful shutdown themselves. With ASG not downscaling on Runner shutdown (#47 - closed) in mind, this means that we're unable to scale the ASG down before exiting the runner process. And as described in ASG not downscaling on Runner shutdown (#47 - closed) that - depending on the number of instances that were up when we've initiated runner shutdown - we will generate significant cost related to totally wasted resources.
We need to find a way to fix that - this is important to keep costs of operation on the sane level!
Out of my head I see few ways how we could handle this. Please note, that these are just raw ideas without any validation of which one would be best nor whether they even make sense:
-
Handle
SIGQUIT
in our officially provided plugins and consciously ignore that signal.SIGQUIT
have a special meaning in the GitLab Runner world, so this may create unwanted coupling between Runner and fleeting (which is intended to be a standalone library useful also for other tasks). -
Change how runner starts fleeting plugins, for example by forcing them to be placed in their own process group (we do a similar thing when starting shells for jobs executed in the shell and custom executors).
This may require nasty hacking around HashiCorp plugin system. It also have the side effect that connection between runner process and plugin processes is lowered. Process managers like systemd are sending signals to process groups to avoid leaving orphaned processes. With this approach we may create new problem while fixing the existing one.
-
Reconfigure systemd to send
SIGQUIT
just to the runner process and not the whole process group, but to force-terminate the whole process group.Don't know if it's even possible.
-
Create a wrapper around runner process, that will start runner as its own child process but in a separate process group (read: not controlled directly by systemd) and will be handling signals.
This is basically extension of the idea above - if we can't configure systemd to handle some signals in one way and others in another, we can do that in a middleware that will exist between systemd and runner.