Task kill after timeout use wrong PID if the parent task has disappeared on windows.

Summary

The GitLab runner (Windows) does not respect the specified timeout when the process start some child processes. In some cases, the job runs indefinitely and is never canceled.

Technical details:

After the specified timeout, I can see the runner running, at a regular interval, "taskkill /F /T /PID " with a PID that does not correspond to any active process on the machine.

In the attached screenshot, it tries to kill the process 9072 which does not exists. It should actually kill "ssh-agent.exe" (11268) because the process was started by a process started by the job. As soon as I kill it manually, the job is reported as failed with a message that indicates it failed because of a timeout.

Steps to reproduce

The job start a process “p1.exe” which will start the child process “ssh-agent.exe”. “p1.exe” finish its works and terminate, but “ssh-agent.exe” is still running. In this case it seems the job is stuck. After the timeout, I think GitLab is trying to kill “p1.exe” instead of “ssh-agent.exe” (but I cannot confirm this because it’s too late to check the id of the “p1” process).

Just to be clear, it's the runner that starts "taskkill" not the job. So, it looks like an issue in the runner that is not able to kill the right process.

Actual behavior

taskkill use the old PID of the parent task that started the job.

Expected behavior

taskkill use the current running PID, even though it might be a child process.

Relevant logs and/or screenshots

Ticket reference with test case: (internal only): https://gitlab.zendesk.com/agent/tickets/107445

Environment description

concurrent = 1
check_interval = 0

[[runners]]
  name = "****"
  url = "https://gitlab****/ci"
  token = "*****"
  executor = "shell"
  [runners.cache]

Used GitLab Runner version

Runner Version: 11.2.0 (35e8515d) on Windows