2021-02-22: Pipelines failing with mkdir /builds: read-only file system
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Summary
Pipelines seem to be failing with mkdir /builds: read-only file system
errors.
Examples:
- https://gitlab.com/gitlab-org/gitlab/-/jobs/1045300725
- https://gitlab.com/gitlab-org/gitlab/-/pipelines/259518131
- https://gitlab.com/gitlab-org/gitlab/-/pipelines/259537573
Workaround
Re-run the job if you see:
ERROR: Job failed (system failure): prepare environment: Error response from daemon: error while creating mount source path '/builds/gitlab-org-forks': mkdir /builds: read-only file system (exec.go:57:0s). Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
Timeline
All times UTC.
View recent production deployment and configuration events (internal only)
2021-02-22
-
0033
- @seanarnold declares incident in Slack. -
0044
- @craig confirms that a manual retry on one of the affected jobs completed successfully -
0102
- @craig engages dev on-call -
0134
- @craig merges chef-repo!5034 to revert changes applied for #3672 (closed) -
0140
- @krasio confirms the issue is scoped to gitlab shared runners, only -
0157
- @craig merges chef-repo!5035 to reapply changes from chef-repo!5034 since the revert did not have any effect on this issue -
0208
- @brentnewton joins incident call -
0221
- @tkuah updates incident with a workaround (manually retrying failed pipelines/jobs) -
0239
- @brentnewton downgrades incident to severity3 -
0240
- @brentnewton adds @tmaczukin @steveazz @grzesiek and @erushton to incident channel -
0247
- @tkuah correlates intermittently successful retries to depend on whether jobs launch onprivate-runnsers-manager-XX.gitlab.com
vsgitlab-shared-runners-manager-XX.gitlab.com
-
0340
- @stanhu suggests disabling the shared/builds
volume mount as a potential mitigation -
0604
- @steveazz and @craig change the/builds
volume mount to use a writeable directory on host in chef-repo!5037 -
0627
- @craig and @steveazz change the host path from/tmp
to/mnt/stateful_partition
via chef-repo!5038 -
0649
- @steveazz reports jobs failing to execute scripts -
0700
- @craig hands off to @igorwwwwwwwwwwwwwwwwwwww as incoming EOC -
0707
- @craig confirms the failures are due tonoexec
mount options on/mnt/stateful_partition
-
0804
- @steveazz and @igorwwwwwwwwwwwwwwwwwwww testing/var/lib/docker
as a potential filesystem for underlying host mount that does haveexec
option -
0812
- @steveazz merges change to use/var/lib/docker
for host mount in chef-repo!5039 -
0822
- initial testing complete, proceeding to apply changes to remaininggsrmX
nodes -
0830
- incident mitigated
Corrective Actions
There is a main corrective action issue listing out other related corrective actions: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3685
Incident Review
Summary
- Service(s) affected: ServiceCI Runners
- Team attribution: grouprunner
- Time to detection: 4 days, 14 hours
- Minutes downtime or degradation: 4 days, 22 hours
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Every project using jobs tagged with
gitlab-org
, which is most of where pipelines run for gitlab-org group, and any forks of those projects.
- Every project using jobs tagged with
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Preventing developers from merging their code and getting their code tested
-
How many customers were affected?
- 0
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- No customer was affected only engineers at GitLab Inc. and community contributions.
What were the root causes?
Wrong path for the /builds/gitlab-org-forks
volume
- Jobs were failing, because Docker was unable to mount a
/builds/gitlab-org-forks:/builds/gitlab-org-forks
volume. - Volume was not mountable because the
/
filesystem in the new VM image is readonly. - The mounting path was pointing to the read-only filesystem as it was copied 1:1 from a previous configuration that used a different OS.
- Configuration was not updated properly because it was missed by the author. Other paths discovered during
prmX
deployment in past weeks were configured properly. - The path was missed because it was forgotten. The read-only nature of Google COS distribution is a new thing for us, and while working on the configuration we've detected several places when it was interfering. We thought all were caught when we've done test deployment on
prmX
some time ago. But the/builds/gitlab-org-forks
mounting option is a thing specific forgsrmX
runner managers and while describing all the steps in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12504 it was simply forgotten.
Wrong Docker Machine version
- Usage of Google COS image and its configuration requires that we will use a Docker Machine version with support for it that we've implemented some time ago. The new version was deployed on
prmX
runners before deploying the CoreOS -> Google COS configuration change. It was not deployed here. - The new version was not deployed here, as it was forgotten by the author of the change.
- It was forgotten because it was falsely assumed that "we will remember about that when moving the rollout further".
- The assumption was made because when the rollout was started "it was obvious that we need it". And instead of deploying new version of Docker Machine across the managed runners fleet after it was confirmed that it works on
prmX
, the plan was to "follow the steps that we've made". - When the final rollout plan at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12504 was described, it was forgotten that we don't have the required version of Docker Machine on other runner managers and missed from the checklists.
Incident Response Analysis
-
How was the incident detected?
After the company and community contributors started to wake up into a new working week, jobs on
gsrmX
runner managers started to failing. When the consistency of failures was discovered, the incident got reported through slack.Apart of that we had a second incident caused by exactly the same changes (#3672 (closed)) which was discovered at yesterday evening and also discussed at slack.
-
How could detection time be improved?
- As noted at #3674 (comment 514526063) this specific cause could be caught if we would have some smart alerting based on an existing metric.
- Have E2E tests for specific features each runner type provides.
-
How was the root cause diagnosed?
-
#3672 (closed)
- Looking at
gitlab-runner
process logs: #3672 (comment 513794147) - Looking at `docker-machine version: #3672 (comment 513794239)
- Looking at
-
#3674 (closed)
- Looking at volume mounts on
gsrmX
inside of theconfig.toml
: #3674 (comment 513896486)
- Looking at volume mounts on
-
#3672 (closed)
-
How could time to diagnosis be improved?
- Have easy kibana logs inside of dashboard to look at runner error logs.
- Familiarity with Google COS
- Change management issue to make it clear new OS is being rolled out
-
docker-machine
version mismatch alert gitlab-org/gitlab-runner#27607 (closed)
-
How did we reach the point where we knew how to mitigate the impact?
- Having someone from grouprunner come online to be able to help out with configuration and add context on set up.
- Understanding the directory structure of Google COS
-
How could time to mitigation be improved?
- Have a clear rollback strategy connected with the introduced change and known to on-call SRE.
- Have a quick way to deploy and enforce new configuration for the ServiceCI Runners
-
What went well?
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
Recently we've had some errors after our initial switch from CoreOS to Google Container Optimized OS (the main reason for the change that failed over the weekend), but these were expected as we've never worked with this OS directly and assumed that "some things will go wrong". This time we'd expected that all problems are passed beyond and things will go well. The incident was caused by overlooking two things in the configuration, that should be handled but were not obvious.
So I'd say No, we didn't have other events caused by the same root cause. Or to be precise: not in the quite long time. It was a problem of wrong configuration when a lot of things need to be shipped together. We don't do such updates often in the ServiceCI Runners area so not many occasions to create such problem.
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
This falls into a bucket of multiple problems caused by the current deployment mechanism for GitLab.com shared runners fleet. Update of the deployment and configuration mechanism is something that we're looking on since a long time and we're planning to start working at it soon. An issue that describes the initial plan (that needs a refresh with what we've updated and/or learned in past years) can be found at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/4813.
Having a deployment mechanism that is fully automated, where configuration changes can be introduced within minutes instead of hours and where the changes can be easily reverted within minutes could both allow us to find out about the problem just after the configuration change was merged (before the runner users went off) and after the incident was detected, revert could be a one-call operation.
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
The incident was caused by overlooking two things in configuration while working on the CoreOS -> Google COS rollout. The direct cause of the failure was introduced why applying https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12504#deploy-google-cos-to-gitlab-shared-runners-manager-xgitlabcom and https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12504#prepare-for-merging
Lessons Learned
Copying from the notes that we've taken during post-incident retrospective:
- Create a rollout plan for the docker-machine version upgrade as a separate issue/todo list so we don't forget to upload all of the fleet, as we did in #3672 (closed)
- Create a runbook for
docker-machine
rollout to help folks understand what need to be done to upgrade thedocker-machine
version and that no downtime it required.
- Create a runbook for
- Create a set of E2E tests for each type of runner manager fleet and test platform-specific tests
- For example for
gsrmX
that a pipeline from a fork can run successfully
- For example for
- Tag each/set of runners with stages, so we can do a rolling deployment.
- When a complex update is needed, do a rolling deployment for each runner manager type, for example;
-
gsrm3
- First to deploy, let it bake for 1-2 hours -
gsrm4-5
- Deploy to this fleet, let it bake for 30-40miniute - Rollout to 100% of the fleet.
-
- Start using https://about.gitlab.com/handbook/engineering/infrastructure/change-management/ for ServiceCI Runners related changes. Make it clear what is the progress and what is the rollback strategy so if problem arise SRE on-call can easily rollback without development teem assistance.
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)