Task kill after timeout use wrong PID if the parent task has disappeared on windows.
The GitLab runner (Windows) does not respect the specified timeout when the process start some child processes. In some cases, the job runs indefinitely and is never canceled.
After the specified timeout, I can see the runner running, at a regular interval, "taskkill /F /T /PID " with a PID that does not correspond to any active process on the machine.
In the attached screenshot, it tries to kill the process 9072 which does not exists. It should actually kill "ssh-agent.exe" (11268) because the process was started by a process started by the job. As soon as I kill it manually, the job is reported as failed with a message that indicates it failed because of a timeout.
Steps to reproduce
The job start a process “p1.exe” which will start the child process “ssh-agent.exe”. “p1.exe” finish its works and terminate, but “ssh-agent.exe” is still running. In this case it seems the job is stuck. After the timeout, I think GitLab is trying to kill “p1.exe” instead of “ssh-agent.exe” (but I cannot confirm this because it’s too late to check the id of the “p1” process).
Just to be clear, it's the runner that starts "taskkill" not the job. So, it looks like an issue in the runner that is not able to kill the right process.
taskkill use the old PID of the parent task that started the job.
taskkill use the current running PID, even though it might be a child process.
Relevant logs and/or screenshots
Ticket reference with test case: (internal only): https://gitlab.zendesk.com/agent/tickets/107445
concurrent = 1 check_interval = 0 [[runners]] name = "****" url = "https://gitlab****/ci" token = "*****" executor = "shell" [runners.cache]
Used GitLab Runner version
Runner Version: 11.2.0 (35e8515d) on Windows