The ability to enable graceful shutdown on Windows is now available behind a feature flag. To enable, set the feature flag FF_USE_WINDOWS_LEGACY_PROCESS_STRATEGY to false.
Overview
Currently "graceful shutdown" - understood as Runner waiting for all executed jobs to exit before Runner process itself is finished - is supported only for Unix systems, where SIGQUIT signal can be sent to the process. Windows doesn't have such concept and currently process termination triggers the force-shutdown strategy where all executed jobs are interrupted.
We should start supporting "graceful shutdown" also for Windows.
Proposal
Introduce an optional way to turn down GitLab Runner in a three-times-interrupt strategy:
Graceful waiting for jobs to finish and not accepting new jobs.
First interrupt call (whether it's done by a Signal in Unix way or in the native Window way): would transfer runner to the graceful shutdown state. As a reminder: no new jobs are being requested and runner waits to finish all already handled jobs.
More forceful shutdown, ensuring all jobs are killed.
Second interrupt call that stops the graceful shutdown and forces runner to immediately stop all jobs. GitLAb Runner waits for the job to exit before turning down.
Really forceful, kill Runner process and don't worry about jobs that might still be running.
Third interrupt call immediately stops runner process and ignores the running jobs.
We've talked with @steveazz about possible solutions (including my current PoC from !1688 (closed)).
In general, we think we should make the graceful shutdown strategy the default one, executed always on all platforms when Runner's process is being interrupted. In our opinion there is no need for special signal like SIGQUIT is used now. Simply first interrupt signal should start the graceful shutdown, second interrupt should switch to the forceful shutdown and third interrupt signal should immediately finish the process (as it's happening currently for the second interrupt). Then calling gitlab-runner stop or interrupting a Runner process running in the foreground would be handled in a consistent way on all platforms. In our opinion in 99% cases using the graceful shutdown is what the user wants and in those specific cases when faster termination is preferred over finishing the jobs properly, one can always force-exit by using kill on Unix platforms or taskkill/task manager on Windows.
This is the ideal solution, but it's also a breaking change in Runner's behavior. It may affect user's integrations and deployment mechanisms. Therefor we propose such plan:
In 12.6, as part of our ~"Shared Runners::Windows" MVC, we will add a way to emulate SIGQUIT usage on Windows (details bellow).
We will deprecate old shutdown behavior and replace it in %13.0 (preparing this with a deprecation notes before).
In %13.0 we will implement the above strategy of using graceful shutdown always, as the main behavior but leaving the old one hidden behind a feature flag.
The feature flag will have the EOL set to, let's say, %13.6, after which the old behavior will be removed totally.
As for emulating SIGQUIT sending on Windows, we've decided to create a mechanism that will look for a specific file. Let's say that configuration file is present at c:\gitlab-runner\config.toml. We will add a mechanism that is looking if a file c:\gitlab-runner\config.toml.quit was created. If yes, then we will internally send the syscall.SIGQUIT value to the mr.stopSignals channel, emulating the behavior of sending SIGQUIT on Unix machines. To make sure that Runner will be able to cleanup such file and that it's not misspelled, we will add a gitlab-runner graceful-stop command which will create the proper file in the proper place.
Both solutions will be compiled only for Windows and both should clearly log that they are a temporary solutions that will be removed with changes implemented in %13.0.
They chose to fork the runner to deal with this as a workaround, and it is now hindering them upgrading GitLab. Having a solid and trusted solution will greatly help them make managing developer tools easier, and their Windows pipelines more reliable.
They chose to fork the runner to deal with this as a workaround
@pharlan I'm curious what is their workaround, can they share it with us. It might help us whilst we are still developing this feature. Are they able to share their code with us, how is it working for them?
Customer utilizes Windows runners extensively so also cannot use SIGQUIT directly so an emulation path or any way to gracefully shutdown would be a significant improvement.
@DarrenEastman - I just spoke again with https://gitlab.my.salesforce.com/00161000004bZxf and learned that they are still forced to update their forked version of their runners each time they want to upgrade their GitLab instance. It is a burden that causes significant delays in upgrades and thus delays around adoption of the glories that GitLab keeps providing. Would it be possible to get some additional insight into the plans around resolving these issues or strategies for alternate solutions? This had been targeted for 12.7-13.3 and now taken out and put in the backlog. Any update would be most appreciated.
@oheigre As of now this is in ready for development which means that it will soon be re-assigned to an engineer to complete the remaining development tasks. Given the current team capacity, and the active work on the Runner Kanban boardi.e. (issues in development, ready for review, in-review), I forecast that we will be able to get an engineer assigned to this during the 13.6 milestone. I will review this assumption with the team in our next team meeting this Monday, October 12, and provide an update here as needed.
is supported only for Unix systems, where SIGQUIT signal can be sent to the process.
Can you clarify why we only support SIGQUIT here and not also SIGTERM? os/signal already maps some Windows concepts onto SIGTERM for graceful handling of processes.
@ajwalker I'm currently writing down all the problems that I've encountered so far while trying to solve this issue. I'll let you know when it will be done and I propose to sync then on a call.