gitlab runner regularly calls taskkill with a free/stale PID randomly killing build processes

Summary

Since yesterday (or maybe the day before) we see the windows build jobs on our private runners randomly dying. Researching the problem with sysinternals procmon on the runners I found that gitlab-runner.exe regularly calls taskkill with a free/stale/outdated PID which could randomly be assigned to aby build process. When a build process happens to have this PID it gets killed which terminates the build process.

Steps to reproduce

Run a windows build job which creates several 10.000 processes (our builds typically run for 4 .. 6 hours). In the log you will see that some processes randonly die without error message.

Start procmon or processexplorer and a private windows runner and look at what gitlab-runner does. Convince yourself that a process with the giveb PID does not exist, do that the PID is free for reassignment.

Example Project

Look at any failed job in https://gitlab.com/coq/coq/-/jobs with tag windows-inria.

What is the current bug behavior?

processes are randomly killed by gitlab-runner

What is the expected correct behavior?

jobs are not randomly killed

Relevant logs and/or screenshots

Look at any failed job in https://gitlab.com/coq/coq/-/jobs with tag windows-inria.

Bug

A process with the PID given usually does not exist but during build might exist for a short time and gets killed then.

Output of checks

Results of GitLab environment info

Not sure how to do thi son windows runners

Results of GitLab application Check

Not sure how to do thi son windows runners

Possible fixes

Make sure that the PID given to taskkill is for a process currently owned by gitlab-runner.