American Family - RCA request for Docker Machine v23 incident

Please note: if the incident relates to sensitive data, or is security related consider labeling this issue with gitlab-com/gl-security/security-operations/sirt/operations~3834949 and mark it confidential.

Summary

Context

Ostensibly this is a Docker incident and not a GitLab incident. However, the customer has requested and RCA an I think it is fair to outline the facts and let the chips fall where they may. The customer is requesting formal documentation around the "incident" to provide to their Incident Response Team.

The customer was pulling the latest version of Docker Machine and was not pulling the GitLab released version which caused their ssh failures.

Here is the related issue

Here is the related support ticket

Service(s) affected : Runner auth failure to Docker Engine due to missing certs directory (/etc/docker) Team attribution : Minutes downtime or degradation : ~2.5 hours

Impact & Metrics

Start with the following:

Question	Answer
What was the impact	Docker machine would fail to create VMs when starting CI jobs due to the release of Docker 23 https://docs.docker.com/engine/release-notes/23.0/#2300
Who was impacted	Self hosted users who would pull the latest Docker version automatically when starting a new Docker Machine managed VM
How did this impact customers	Inability to run CI jobs that required new VMs to be started as they couldn't
How many attempts made to access	N/A
How many customers affected	N/A
How many customers tried to access	N/A

Include any additional metrics that are of relevance.

Provide any relevant graphs that could help understand the impact of the incident and its dynamics.

Detection & Response

Start with the following:

Question	Answer
When was the incident detected?	2023-02-02 2:20 UTC
How was the incident detected?	User report at #29594 (moved)
Did alarming work as expected?	Relevant people were brought into the in under 2 hours
How long did it take from the start of the incident to its detection?	Docker 23 was released on 2023-02-01 (https://docs.docker.com/engine/release-notes/23.0/#2300) - One day
How long did it take from detection to remediation?	3 hours
What steps were taken to remediate?	A fix was issued that would create the needed directory structure if it's missing https://gitlab.com/gitlab-org/ci-cd/docker-machine/-/releases/v0.16.2-gitlab.19
Were there any issues with the response?	N/A

MR Checklist

Consider these questions if a code change introduced the issue.

Question	Answer
Was the MR acceptance checklist marked as reviewed in the MR?
Should the checklist be updated to help reduce chances of future recurrences? If so, who is the DRI to do so?

Timeline

YYYY-MM-DD

2:20 UTC - Issue was reported
4:45 UTC - Fix for our Docker machine fork was issued

Root Cause Analysis

At 2023-02-01 Docker have released Docker Engine version 23.0.0.

One of the released changes was to not create an /etc/docker directory if it doesn't exist at process start. Before that release, this directory was always (re)created by Docker Engine process. While optional files that could be placed in this directory are still supported, existence of the directory was made fully optional. BTW, that change was considered as a bug and it got fixed (for the DEB and RPM packaged released of Docker) in the version 23.0.1 released two days later 👉 https://github.com/docker/docker-ce-packaging/pull/841.

Docker Machine - which in our case is used by Runner to autoscale VMs where jobs are executed - is responsible for installing (optional step) and provisioning (happens always) of the VM to make it usable for Docker Client (Runner process itself in our case). One of the steps during provisioning is to generate and install TLS authentication certificate and key, that's next used to authenticate calls to Docker Engine API. Docker Machine expects that the /etc/docker directory exists and during provisioning tries to write files there.

The incident have affected only users who used VM images that don't have Docker Engine installed already. This is why that problem was not detected on SaaS Linux Runners, where we also use Docker Machine executor. As we have Docker Engine preinstalled on the images we use to boot-up the VMs, Docker Machine is skipping the installation process.

For users who don't have Docker Engine preinstalled in their image of choice, Docker Machine was initiating installation step. Docker Machine, which was written a very long time ago, for most of platforms detected by its provisioning mechanism was simply expecting that /etc/docker directory is there, because it was always created back then. Some of the platform-specific provisioners had an additional mkdir -p /etc/docker command execution, but most didn't.

When new Docker Engine version got released and available in the official release channels, it was automatically chosen by the provisioner for installation. This means that for such configurations, Docker Engine without /etc/docker directory was installed. When Docker Machine tried to save TLS certificate and key few steps later, the provisioning was failing with no such file or directory error.

From this moment, there were only four ways how this could be fixed:

Use a VM image with Docker Engine already installed (this is what we do on SaaS Linux Runners).
Pin the version of Docker Engine to be installed in the Docker Machine configuration (which was quickly suggested by community members as a confirmed workaround).
Update Docker Machine code to not expect /etc/docker's existence and always (re)create it before using (which become the fix we've quickly added to our Docker Machine fork).
Restore creation of /etc/docker on Docker Engine installation (which eventually was considered by Docker itself as a bug fix and released in a patch version 23.0.1).

What went well

Issue was quickly identified
Fix was quickly issued and released (for our Docker Machine fork)

What can be improved

We could advocate more towards users locking their Docker versions and upgrading them deliberately rather than getting the latest and greatest by default.

As proven multiple times in the past, also with Docker releases (but with problems no related to Docker Machine), having the latest version installed automatically without prior tests on a production critical setups is risky. There is always a risk that a new version brings new, unknown regressions. These are often detected and fixed quickly, but it may be enough to break user's setup at a critical moment.

By using a locked version of the tooling and making conscious decisions about upgrading, based on previous tests and confirmation that things still work, users may save themselves a lot of troubles.

We've started advocating towards using locked versions of container images used for GitLab CI/CD job definitions, after some popular images have caused similar class of "unknown regression released in :latest" problems. Until we depend on Docker Machine as the critical part of Runner's autoscaling, we should encourage users to use their own VM images based on the distributions they like to use, but with some tools - especially including Docker Engine - already installed.

Corrective actions

We should update Docker Machine executor documentation and strongly suggest to use specially prepared VM images with Docker Engine already preinstalled.

Other than that, no corrective action is planned.

Given that Docker environment and ecosystem is evolving and Docker Machine got abandoned 5 years ago, we may and should expect random failures to be happening in the future. We don't "own" Docker Engine and can't proactively handle all possible changes.

Our main goal and focus is on creating a new autoscaling mechanism for Runner that will replace Docker Machine, making Docker Machine execution redundant and deprecated at some point and giving us possibility to control the evolution of the new autoscaling engine.

Guidelines

/cc @madkerr @erushton for awareness

Edited Feb 13, 2023 by Tomasz Maczukin