Regenerate docker-machine certs during gitlab-runner upgrades
Task
Opportunistically regenerate the CA and client certificates used by docker-machine
on our CI Runner Manager hosts to provision VMs. This avoids a race condition that can corrupt the certs if they expire naturally while the host is in service.
We deploy the gitlab-runner
service more often than the certs expire, and that planned downtime provides an opportunity when no other docker-machine
processes should be running, hence avoiding the race condition.
Background
Currently our CI Runner Manager hosts use docker-machine
to provision VMs for running CI jobs in docker containers. The manager host acts as a CA and signs its own client cert as well as the per-VM server cert. The CA and client certs have a 3-year lifespan, and when they expire, the next invocation of docker-machine create
will generate a fresh pair of certs. However, in a high-throughput environment like GitLab.com, it is very likely that multiple docker-machine processes will concurrently try to generate new certs, each one overwriting the other's work. This can lead to mismatched or corrupt CA and client certs. When that occurs (see production#1609 (comment 283265339)), the host running docker-machine can provision new VMs, but those VMs reject all attempts to authenticate to its dockerd
daemon using the damaged or untrusted client cert. Consequently, the manager host fails to run every job it pulls.