Update the new Kubernetes execution strategy to work with Windows pods (!3048) · Merge requests · GitLab.org / gitlab-runner

Arran Walker requested to merge 28081-windows-pods-cannot-be-scheduled-with-the-kubernetes-newer-execution-strategy into main Jul 22, 2021

What does this MR do?

Updates the new execution strategy used by Kubernetes to support Windows pods.

Why was this MR needed?

The strategy has a reliance on Linux tools.

I've dropped the permission init container, in favour of changing the permission of the log within the trap script. This gives us greater flexability over the commands run in the context of the shell used (for example, I needed script specific to the Powershell whether it be on a Linux or Windows pod), and puts scripting back within the shells package, rather than the kubernetes package dealing with script directly. The permission "fix" script was the only thing making use of init containers, so that functionality and associated tests have now been removed.
The recent PowerShell trap shell support works for Linux, but not Windows, as it requires an open log file that is written to from two processes, which Windows does not support.

The command executed is roughly: parse_pwsh_script.ps1 /path/to/stage_script > log/output.log, with the PowerShell Trap logic also writing to log/output.log. This doesn't work on Windows.

Instead of using Add-Content to specifically add the json exit code information, I've opted to just echo it out to stdout, which gets tee'd to log/output.log via the redirect so log processing still functions as expected.

The largest change here is that stdout will not also have the json exit code, and doing a kubectl logs <container> directly will yield such. However, I'm not sure this is a problem, and might even be helpful for debugging.

What's the best way to test this MR?

The trap exit code has to function when using the new execution strategy or else the logic just hangs, so testing any simple job for both Linux and Windows using the Kubernetes executor is likely enough for a manual QA.

For Windows specifically:

Setup a Windows k8s cluster on GCP.
Authenticate locally with gcloud container clusters get-credentials <cluster name>

Configure runner's config.toml

[[runners]]
...
executor = "kubernetes"
shell = "pwsh"

[runners.feature_flags]
  # If deploying the Runner to Linux, but targetting Windows, this FF needs to be enabled for now
  FF_USE_POWERSHELL_PATH_RESOLVER = true

  # We explicitly want to ensure the new strategy is used, but that's the default anyway now
  # FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY = false
[runners.kubernetes]
  image = "mcr.microsoft.com/powershell:lts-nanoserver-1809"
  [runners.kubernetes.node_selector]
    "kubernetes.io/arch" = "amd64"
    "kubernetes.io/os" = "windows"
    "node.kubernetes.io/windows-build" = "10.0.17763"

Run a simple job.

For Linux/OpenShift etc:

The tests and manual QA steps defined in !2749 (merged) should be run through, as these were the steps introduced to test the permission container step. Whilst this MR looks like a larger change, the commands to set the permissions are identical and run within the same environment and still run before any user build step.

What are the relevant issue numbers?

Closes #28081 (closed)

Edited Jul 26, 2021 by Arran Walker

Update the new Kubernetes execution strategy to work with Windows pods