2021-02-22: Pipelines failing with mkdir /builds: read-only file system

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Summary

Pipelines seem to be failing with mkdir /builds: read-only file system errors.

Examples:

Workaround

Re-run the job if you see:

ERROR: Job failed (system failure): prepare environment: Error response from daemon: error while creating mount source path '/builds/gitlab-org-forks': mkdir /builds: read-only file system (exec.go:57:0s). Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

Timeline

All times UTC.

View recent production deployment and configuration events (internal only)

2021-02-22

0033 - @seanarnold declares incident in Slack.
0044 - @craig confirms that a manual retry on one of the affected jobs completed successfully
0102 - @craig engages dev on-call
0134 - @craig merges chef-repo!5034 to revert changes applied for #3672 (closed)
0140 - @krasio confirms the issue is scoped to gitlab shared runners, only
0157 - @craig merges chef-repo!5035 to reapply changes from chef-repo!5034 since the revert did not have any effect on this issue
0208 - @brentnewton joins incident call
0221 - @tkuah updates incident with a workaround (manually retrying failed pipelines/jobs)
0239 - @brentnewton downgrades incident to severity3
0240 - @brentnewton adds @tmaczukin @steveazz @grzesiek and @erushton to incident channel
0247 - @tkuah correlates intermittently successful retries to depend on whether jobs launch on private-runnsers-manager-XX.gitlab.com vs gitlab-shared-runners-manager-XX.gitlab.com
0340 - @stanhu suggests disabling the shared /builds volume mount as a potential mitigation
0604 - @steveazz and @craig change the /builds volume mount to use a writeable directory on host in chef-repo!5037
0627 - @craig and @steveazz change the host path from /tmp to /mnt/stateful_partition via chef-repo!5038
0649 - @steveazz reports jobs failing to execute scripts
0700 - @craig hands off to @igorwwwwwwwwwwwwwwwwwwww as incoming EOC
0707 - @craig confirms the failures are due to noexec mount options on /mnt/stateful_partition
0804 - @steveazz and @igorwwwwwwwwwwwwwwwwwwww testing /var/lib/docker as a potential filesystem for underlying host mount that does have exec option
0812 - @steveazz merges change to use /var/lib/docker for host mount in chef-repo!5039
0822 - initial testing complete, proceeding to apply changes to remaining gsrmX nodes
0830 - incident mitigated

Corrective Actions

There is a main corrective action issue listing out other related corrective actions: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3685

Incident Review

Summary

Service(s) affected: ServiceCI Runners
Team attribution: grouprunner
Time to detection: 4 days, 14 hours
Minutes downtime or degradation: 4 days, 22 hours

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Every project using jobs tagged with gitlab-org, which is most of where pipelines run for gitlab-org group, and any forks of those projects.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Preventing developers from merging their code and getting their code tested
How many customers were affected?
1. 0
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. No customer was affected only engineers at GitLab Inc. and community contributions.

What were the root causes?

"5 Whys"

Wrong path for the /builds/gitlab-org-forks volume

Jobs were failing, because Docker was unable to mount a /builds/gitlab-org-forks:/builds/gitlab-org-forks volume.
Volume was not mountable because the / filesystem in the new VM image is readonly.
The mounting path was pointing to the read-only filesystem as it was copied 1:1 from a previous configuration that used a different OS.
Configuration was not updated properly because it was missed by the author. Other paths discovered during prmX deployment in past weeks were configured properly.
The path was missed because it was forgotten. The read-only nature of Google COS distribution is a new thing for us, and while working on the configuration we've detected several places when it was interfering. We thought all were caught when we've done test deployment on prmX some time ago. But the /builds/gitlab-org-forks mounting option is a thing specific for gsrmX runner managers and while describing all the steps in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12504 it was simply forgotten.

Wrong Docker Machine version

Usage of Google COS image and its configuration requires that we will use a Docker Machine version with support for it that we've implemented some time ago. The new version was deployed on prmX runners before deploying the CoreOS -> Google COS configuration change. It was not deployed here.
The new version was not deployed here, as it was forgotten by the author of the change.
It was forgotten because it was falsely assumed that "we will remember about that when moving the rollout further".
The assumption was made because when the rollout was started "it was obvious that we need it". And instead of deploying new version of Docker Machine across the managed runners fleet after it was confirmed that it works on prmX, the plan was to "follow the steps that we've made".
When the final rollout plan at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12504 was described, it was forgotten that we don't have the required version of Docker Machine on other runner managers and missed from the checklists.

Incident Response Analysis

How was the incident detected?

After the company and community contributors started to wake up into a new working week, jobs on gsrmX runner managers started to failing. When the consistency of failures was discovered, the incident got reported through slack.

Apart of that we had a second incident caused by exactly the same changes (#3672 (closed)) which was discovered at yesterday evening and also discussed at slack.
How could detection time be improved?
1. As noted at #3674 (comment 514526063) this specific cause could be caught if we would have some smart alerting based on an existing metric.
2. Have E2E tests for specific features each runner type provides.
How was the root cause diagnosed?
1. #3672 (closed)
  1. Looking at gitlab-runner process logs: #3672 (comment 513794147)
  2. Looking at `docker-machine version: #3672 (comment 513794239)
2. #3674 (closed)
  1. Looking at volume mounts on gsrmX inside of the config.toml: #3674 (comment 513896486)
How could time to diagnosis be improved?
1. Have easy kibana logs inside of dashboard to look at runner error logs.
2. Familiarity with Google COS
3. Change management issue to make it clear new OS is being rolled out
4. docker-machine version mismatch alert gitlab-org/gitlab-runner#27607 (closed)
How did we reach the point where we knew how to mitigate the impact?
1. Having someone from grouprunner come online to be able to help out with configuration and add context on set up.
2. Understanding the directory structure of Google COS
How could time to mitigation be improved?
1. Have a clear rollback strategy connected with the introduced change and known to on-call SRE.
2. Have a quick way to deploy and enforce new configuration for the ServiceCI Runners
What went well?

Post Incident Analysis

Did we have other events in the past with the same root cause?

Recently we've had some errors after our initial switch from CoreOS to Google Container Optimized OS (the main reason for the change that failed over the weekend), but these were expected as we've never worked with this OS directly and assumed that "some things will go wrong". This time we'd expected that all problems are passed beyond and things will go well. The incident was caused by overlooking two things in the configuration, that should be handled but were not obvious.

So I'd say No, we didn't have other events caused by the same root cause. Or to be precise: not in the quite long time. It was a problem of wrong configuration when a lot of things need to be shipped together. We don't do such updates often in the ServiceCI Runners area so not many occasions to create such problem.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?

This falls into a bucket of multiple problems caused by the current deployment mechanism for GitLab.com shared runners fleet. Update of the deployment and configuration mechanism is something that we're looking on since a long time and we're planning to start working at it soon. An issue that describes the initial plan (that needs a refresh with what we've updated and/or learned in past years) can be found at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/4813.

Having a deployment mechanism that is fully automated, where configuration changes can be introduced within minutes instead of hours and where the changes can be easily reverted within minutes could both allow us to find out about the problem just after the configuration change was merged (before the runner users went off) and after the incident was detected, revert could be a one-call operation.
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.

The incident was caused by overlooking two things in configuration while working on the CoreOS -> Google COS rollout. The direct cause of the failure was introduced why applying https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12504#deploy-google-cos-to-gitlab-shared-runners-manager-xgitlabcom and https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12504#prepare-for-merging

Lessons Learned

Copying from the notes that we've taken during post-incident retrospective:

Create a rollout plan for the docker-machine version upgrade as a separate issue/todo list so we don't forget to upload all of the fleet, as we did in #3672 (closed)
1. Create a runbook for docker-machine rollout to help folks understand what need to be done to upgrade the docker-machine version and that no downtime it required.
Create a set of E2E tests for each type of runner manager fleet and test platform-specific tests
1. For example for gsrmX that a pipeline from a fork can run successfully
Tag each/set of runners with stages, so we can do a rolling deployment.
When a complex update is needed, do a rolling deployment for each runner manager type, for example;
1. gsrm3 - First to deploy, let it bake for 1-2 hours
2. gsrm4-5 - Deploy to this fleet, let it bake for 30-40miniute
3. Rollout to 100% of the fleet.
Start using https://about.gitlab.com/handbook/engineering/infrastructure/change-management/ for ServiceCI Runners related changes. Make it clear what is the progress and what is the rollback strategy so if problem arise SRE on-call can easily rollback without development teem assistance.

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Incident Review Stakeholders

Edited Mar 02, 2021 by Dave Smith