Ensure subprocess termination if GitLab Runner exits on Windows

What does this MR do?

  • Wrap the GitLab Runner process in a Windows Job, single handle in the Runner process
  • If the runner process terminates, the Job handle closes, and terminates all subprocesses / shell executions
  • This ensures that under any error condition of the runner process - we have proper cleanup

Why was this MR needed?

  • On unexpected exits of the runner - it would leak running processes / shell executions
  • The runner would be automatically restarted by it's service
  • New GitLab job executions would allocate the colliding job scratch folders
  • File handles from the leaked processes would disallow opening files there, git cloning, etc - making new jobs fail
    • we have this happen every couple of days in our GitLab deployment + Runners fleet, requiring cleaning up the machine manually (due to very long-running GitLab jobs)

What's the best way to test this MR?

  • This MR contains a system test that checks:
    • Calls the EnsureSubprocessTerminationOnExit method in a test process
    • Spawn a subprocess
    • Terminate the test process
    • Check the child is terminated as well

Leading CR questions

  • I've added this encapsulation to the Run and RunSingle commands.
    • Question: Maybe just put it in main once instead? What do you think?
  • Private -> Exported methods in job_windows.go - I think it's OK they are exported, as this file is supposed to expose process/job functionality.
  • Should this be gated by the FF_USE_WINDOWS_JOB_OBJECT feature flag?
    • I don't think so, I don't see a legitimate use case where one would explicitly want to leak subprocesses when a runner dies.
  • Should we drop the version check (Win8+) - as the documentation explicitly says the GitLab Runner is supported only on Microsoft-supported Windows versions?
    • Then it may fail with less indicative errors on Windows 7 when used together with FF_USE_WINDOWS_JOB_OBJECT
    • But we get a little cleaner/shorter code
    • Or maybe even push the version check globally into a more central location (warn)

Merge request reports

Loading