Skip to content

Unset CI_SERVER_TLS_CA_FILE before registering the runner

Tomasz Maczukin requested to merge fix-tls-certificates-problem into main

We allow to set the CA file explicitly for the Runner process. If it's done, it skips parsing the server response at all and uses whatever was set.

It can be done by placing the CA Chain file at the "well-known path", by pointing the file through a config file or... by using an environment variable during registration, which populates the configuration file field.

This causes the problems in the runner-incept project, as our main jobs - the ones that are starting temporary incept runners - are executed on our SaaS environment where we use internal networking for API and Git communication.

So the job in which we register the temporary runner has the CI_SERVER_TLS_CA_FILE variable set to a file that contains CA Chain of the intnernal load balancer. This variable is recognized by the registration command and is placed into the config.toml file of the temporary runner. At the same time we configure the runner to communicate directly with https://gitlab.com. Which itself is fine.

But when we next start the incept scenario job, runner is configured to "trust" the internal load balancer's CA chain.

API call to GitLab.com still succeeds, as whatever was set in tls-ca-file in config.toml is added to the certificates in the system store. So the GitLab API client in Runner is able to communicate.

Job is started and while its execution Runner configures Git to use CI_SERVER_TLS_CA_FILE as the source of thruth for certificates verification.

Unfortunately, libcurl that Git uses for git+http requests, doesn't join configured certificates with the system store. Is CA_FILE is pointed, it's the only source of truth.

This means that our job, which tries to clone sources from https://gitlab.com fails, because Git is uable to verify the TLS connection with internal load balancer's CA Chain.

As the jobs that start temporary incept runners were randomly running on different SaaS runner shards, where some use internal networking and some (yet!) don't use, we had random failures in the incept scenario jobs. At the moment when we would migrate the last shard to use internal networking (which will happen soon!), all jobs would start failing consistently.

By unsetting CI_SERVER_TLS_CA_FILE before calling gitlab-runner register command, we make sure that the "wronge" CA Chain is not added to the config.toml file and that the runner will work in the discover CA Chain from job request API response mode.

Merge request reports