Limit the duration of TLS keepalive connection reuse
Background
Recently GitLab.com rotated its SSL certs to use one signed by DigiCert to one signed by LetsEncrypt. However, as described in gitlab-com/gl-infra/production#17265 (closed), this caused a number of issues. Many users had to restart the runners for the new certs because:
- GitLab Runner establishes a TLS keep-alive connection with GitLab.com via Cloudflare. Before the cert change, this was using the DigiCert root.
- When no jobs are available, Workhorse holds the request in a long poll.
- When the long poll timeout finishes (50 s, set by
apiCiLongPollingDuration
), the Runner retries with another request, reusing the TLS connection. - This repeats until a job is available for the runner.
- When a job becomes available, the Runner extracts the TLS certs from this keep-alive connection and builds this for
CI_SERVER_TLS_CA_FILE
. Unfortunately, since this connection was established at step 1, the certs are old. - The Runner helper attempts to run
git clone
with the certs from step 5, but these don't match the new LetsEncrypt cert. - The job fails, and the Runner goes back to step 2 with the existing TLS connection.
When a Runner connects to GitLab.com, the TLS connection is between Cloudflare and the Runner. Any changes in the certs don't shut down existing TLS connections.
Restarting the Runner helps because it restarts that TLS keep-alive connection, which will receive the new cert.
Proposal
Currently the Runner keeps the default DisableKeepAlive
setting to false
, which ensures that API requests to POST /api/v4/jobs/request
get reused on the same connection before and after jobs run. In my local test with Wireshark, I saw that this connection appears to live indefinitely.
In https://developers.cloudflare.com/load-balancing/load-balancers/, Cloudflare mentions:
Cloudflare reuses open TCP connections for up to 15 minutes (900 seconds) after the last HTTP request.
As far as I can tell, Go doesn't have such a feature, but we may want to add a limit to how long the Runner can reuse the existing HTTP connection to force periodic reconnections. We may want to upstream this change this to Golang. This would allow the problem in gitlab-com/gl-infra/production#17265 (closed) to heal automatically.