gitlab runner regularly calls taskkill with a free/stale PID randomly killing build processes
Summary
Since yesterday (or maybe the day before) we see the windows build jobs on our private runners randomly dying. Researching the problem with sysinternals procmon on the runners I found that gitlab-runner.exe regularly calls taskkill with a free/stale/outdated PID which could randomly be assigned to aby build process. When a build process happens to have this PID it gets killed which terminates the build process.
Steps to reproduce
Run a windows build job which creates several 10.000 processes (our builds typically run for 4 .. 6 hours). In the log you will see that some processes randonly die without error message.
Start procmon or processexplorer and a private windows runner and look at what gitlab-runner does. Convince yourself that a process with the giveb PID does not exist, do that the PID is free for reassignment.
Example Project
Look at any failed job in https://gitlab.com/coq/coq/-/jobs with tag windows-inria.
What is the current bug behavior?
processes are randomly killed by gitlab-runner
What is the expected correct behavior?
jobs are not randomly killed
Relevant logs and/or screenshots
Look at any failed job in https://gitlab.com/coq/coq/-/jobs with tag windows-inria.
A process with the PID given usually does not exist but during build might exist for a short time and gets killed then.
Output of checks
Results of GitLab environment info
Not sure how to do thi son windows runners
Results of GitLab application Check
Not sure how to do thi son windows runners
Possible fixes
Make sure that the PID given to taskkill is for a process currently owned by gitlab-runner.