gitlab-runner-helper on docker Windows fails internal certificate revocation lists checks within a proxied environment
Context first, this is somewhat similar to #2434 & #28135 but not the same, because git clones do work in our case:
- We are running as part of our self-hosted gitlab infra a farm of Windows Runners (currently 16.0.2 and not 16.1.0 because of #35848 (closed)), both in
1809
and21H2
flavors - Our environment is proxied, so the runners do not have direct internet access, but can access it via the proxy and appropriate environment settings
- Our jobs artifacts & caches are stored in S3 buckets
- The main gitlab endpoint & s3 endpoints are NOT proxied, they can be accessed directly by the runners
- On any TLS interaction, we see in the Windows logs, that Windows tries to contact the CRL endpoint of the CA of the TLS cert our gitlab endpoint and the S3 endpoints, and this seems to be the root of all the issues. We haven't found a way to disable this centrally for the whole OS.
The problem exhibits when trying to execute a job that needs to perform e.g. artifacts download which will fail, we can reproduce the error directly when running the runner helper image:
> docker run -ti registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v16.0.2-servercore1809 powershell
C:\>gitlab-runner-helper artifacts-downloader --url "https://code.siemens.com" --token "xxx" --id "xxx"
ERROR: Downloading artifacts from coordinator... error couldn't execute GET against https://code.siemens.com/api/v4/jobs/xxx/artifacts?: Get "https://code.siemens.com/api/v4/jobs/xxx/artifacts?": net/http: TLS handshake timeout id=xxx token=xxx
WARNING: Retrying... error=invalid argument
ERROR: Downloading artifacts from coordinator... error couldn't execute GET against https://code.siemens.com/api/v4/jobs/xxx/artifacts?: Get "https://code.siemens.com/api/v4/jobs/xxx/artifacts?": net/http: TLS handshake timeout id=xxx token=xxx
WARNING: Retrying... error=invalid argument
ERROR: Downloading artifacts from coordinator... error couldn't execute GET against https://code.siemens.com/api/v4/jobs/xxx/artifacts?: Get "https://code.siemens.com/api/v4/jobs/xxx/artifacts?": net/http: TLS handshake timeout id=xxx token=xxx
FATAL: invalid argument
Checking the system logs, we can see that this is caused by the CRL functionality:
* schannel: next InitializeSecurityContext failed: Unknown error (0x80092013) - The revocation function was unable to check revocation because the revocation server was offline.
* Closing connection 0
* schannel: shutting down SSL/TLS connection with code.siemens.com port 443
curl: (35) schannel: next InitializeSecurityContext failed: Unknown error (0x80092013) - The revocation function was unable to check revocation because the revocation server was offline.
When injecting proxy settings into the runner helper via e.g. pre_get_sources_script
at the runner config level AND triggering an Invoke-WebRequest
, then this internal initialization works properly and the runner helper can continue:
> docker run -ti registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v16.0.2-servercore1809 powershell
PS C:\> $proxyServer = "someproxy-url:1234"
PS C:\> $proxyBypass = @("code.siemens.com", "s3.eu-central-1.amazonaws.com", "s3.dualstack.eu-central-1.amazonaws.com")
PS C:\> $WebProxy = New-Object System.Net.WebProxy($proxyServer, $true, $proxyBypass)
PS C:\> [System.Net.WebRequest]::DefaultWebProxy = $WebProxy
PS C:\> Invoke-WebRequest https://code.siemens.com -UseBasicParsing
StatusCode : 200
...
PS C:\> gitlab-runner-helper artifacts-downloader --url ''https://code.siemens.com'' --token ''xxx'' --id ''xxx''
ERROR: Downloading artifacts from coordinator... unauthorized host=code.siemens.com id=xxx responseStatus=401 Unauthorized status=401 Unauthorized
token=xxx
FATAL: permission denied
We can see the same effect on both the gitlab endpoint & the S3 endpoints. We have tracked the issue to the interaction between Windows' internal TLS handling in schannel (which automatically triggers a validation step of the CRL of the CA) and gitlab-runner-helper, probably at the golang level. We suspect that the custom handling done at https://gitlab.com/gitlab-org/gitlab-runner/-/tree/main/helpers/tls/ca_chain could be part of the issue.
The workaround above (injecting the proxy into the pre get sources script and invoking a GET
via powershell) seems to work for jobs that trigger a git clone, but fails for anything that does not perform a clone (e.g. when GIT_STRATEGY: none
), because of course the script is not executed.