2021-02-22: Pipelines failing with mkdir /builds: read-only file system
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Pipelines seem to be failing with
mkdir /builds: read-only file system errors.
Re-run the job if you see:
ERROR: Job failed (system failure): prepare environment: Error response from daemon: error while creating mount source path '/builds/gitlab-org-forks': mkdir /builds: read-only file system (exec.go:57:0s). Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
All times UTC.
View recent production deployment and configuration events (internal only)
0033- @seanarnold declares incident in Slack.
0044- @craig confirms that a manual retry on one of the affected jobs completed successfully
0102- @craig engages dev on-call
0134- @craig merges chef-repo!5034 to revert changes applied for #3672 (closed)
0140- @krasio confirms the issue is scoped to gitlab shared runners, only
0157- @craig merges chef-repo!5035 to reapply changes from chef-repo!5034 since the revert did not have any effect on this issue
0208- @brentnewton joins incident call
0221- @tkuah updates incident with a workaround (manually retrying failed pipelines/jobs)
0239- @brentnewton downgrades incident to
0240- @brentnewton adds @tmaczukin @steveazz @grzesiek and @erushton to incident channel
0247- @tkuah correlates intermittently successful retries to depend on whether jobs launch on
0340- @stanhu suggests disabling the shared
/buildsvolume mount as a potential mitigation
0604- @steveazz and @craig change the
/buildsvolume mount to use a writeable directory on host in chef-repo!5037
0627- @craig and @steveazz change the host path from
0649- @steveazz reports jobs failing to execute scripts
0700- @craig hands off to @igorwwwwwwwwwwwwwwwwwwww as incoming EOC
0707- @craig confirms the failures are due to
noexecmount options on
0804- @steveazz and @igorwwwwwwwwwwwwwwwwwwww testing
/var/lib/dockeras a potential filesystem for underlying host mount that does have
0812- @steveazz merges change to use
/var/lib/dockerfor host mount in chef-repo!5039
0822- initial testing complete, proceeding to apply changes to remaining
0830- incident mitigated
There is a main corrective action issue listing out other related corrective actions: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3685
- Service(s) affected:
- Team attribution:
- Time to detection: 4 days, 14 hours
- Minutes downtime or degradation: 4 days, 22 hours
Who was impacted by this incident? (i.e. external customers, internal customers)
- Every project using jobs tagged with
gitlab-org, which is most of where pipelines run for gitlab-org group, and any forks of those projects.
- Every project using jobs tagged with
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Preventing developers from merging their code and getting their code tested
How many customers were affected?
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- No customer was affected only engineers at GitLab Inc. and community contributions.
What were the root causes?
Wrong path for the
- Jobs were failing, because Docker was unable to mount a
- Volume was not mountable because the
/filesystem in the new VM image is readonly.
- The mounting path was pointing to the read-only filesystem as it was copied 1:1 from a previous configuration that used a different OS.
- Configuration was not updated properly because it was missed by the author. Other paths discovered during
prmXdeployment in past weeks were configured properly.
- The path was missed because it was forgotten. The read-only nature of Google COS distribution is a new thing for us, and while working on the configuration we've detected several places when it was interfering. We thought all were caught when we've done test deployment on
prmXsome time ago. But the
/builds/gitlab-org-forksmounting option is a thing specific for
gsrmXrunner managers and while describing all the steps in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12504 it was simply forgotten.
Wrong Docker Machine version
- Usage of Google COS image and its configuration requires that we will use a Docker Machine version with support for it that we've implemented some time ago. The new version was deployed on
prmXrunners before deploying the CoreOS -> Google COS configuration change. It was not deployed here.
- The new version was not deployed here, as it was forgotten by the author of the change.
- It was forgotten because it was falsely assumed that "we will remember about that when moving the rollout further".
- The assumption was made because when the rollout was started "it was obvious that we need it". And instead of deploying new version of Docker Machine across the managed runners fleet after it was confirmed that it works on
prmX, the plan was to "follow the steps that we've made".
- When the final rollout plan at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12504 was described, it was forgotten that we don't have the required version of Docker Machine on other runner managers and missed from the checklists.
Incident Response Analysis
How was the incident detected?
After the company and community contributors started to wake up into a new working week, jobs on
gsrmXrunner managers started to failing. When the consistency of failures was discovered, the incident got reported through slack.
Apart of that we had a second incident caused by exactly the same changes (#3672 (closed)) which was discovered at yesterday evening and also discussed at slack.
How could detection time be improved?
- As noted at #3674 (comment 514526063) this specific cause could be caught if we would have some smart alerting based on an existing metric.
- Have E2E tests for specific features each runner type provides.
How was the root cause diagnosed?
How could time to diagnosis be improved?
How did we reach the point where we knew how to mitigate the impact?
- Having someone from come online to be able to help out with configuration and add context on set up.
- Understanding the directory structure of Google COS
How could time to mitigation be improved?
- Have a clear rollback strategy connected with the introduced change and known to on-call SRE.
- Have a quick way to deploy and enforce new configuration for the
What went well?
Post Incident Analysis
Did we have other events in the past with the same root cause?
Recently we've had some errors after our initial switch from CoreOS to Google Container Optimized OS (the main reason for the change that failed over the weekend), but these were expected as we've never worked with this OS directly and assumed that "some things will go wrong". This time we'd expected that all problems are passed beyond and things will go well. The incident was caused by overlooking two things in the configuration, that should be handled but were not obvious.
So I'd say No, we didn't have other events caused by the same root cause. Or to be precise: not in the quite long time. It was a problem of wrong configuration when a lot of things need to be shipped together. We don't do such updates often in thearea so not many occasions to create such problem.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
This falls into a bucket of multiple problems caused by the current deployment mechanism for GitLab.com shared runners fleet. Update of the deployment and configuration mechanism is something that we're looking on since a long time and we're planning to start working at it soon. An issue that describes the initial plan (that needs a refresh with what we've updated and/or learned in past years) can be found at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/4813.
Having a deployment mechanism that is fully automated, where configuration changes can be introduced within minutes instead of hours and where the changes can be easily reverted within minutes could both allow us to find out about the problem just after the configuration change was merged (before the runner users went off) and after the incident was detected, revert could be a one-call operation.
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
The incident was caused by overlooking two things in configuration while working on the CoreOS -> Google COS rollout. The direct cause of the failure was introduced why applying https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12504#deploy-google-cos-to-gitlab-shared-runners-manager-xgitlabcom and https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12504#prepare-for-merging
- Create a rollout plan for the docker-machine version upgrade as a separate issue/todo list so we don't forget to upload all of the fleet, as we did in #3672 (closed)
- Create a runbook for
docker-machinerollout to help folks understand what need to be done to upgrade the
docker-machineversion and that no downtime it required.
- Create a runbook for
- Create a set of E2E tests for each type of runner manager fleet and test platform-specific tests
- For example for
gsrmXthat a pipeline from a fork can run successfully
- For example for
- Tag each/set of runners with stages, so we can do a rolling deployment.
- When a complex update is needed, do a rolling deployment for each runner manager type, for example;
gsrm3- First to deploy, let it bake for 1-2 hours
gsrm4-5- Deploy to this fleet, let it bake for 30-40miniute
- Rollout to 100% of the fleet.
- Start using https://about.gitlab.com/handbook/engineering/infrastructure/change-management/ for related changes. Make it clear what is the progress and what is the rollback strategy so if problem arise SRE on-call can easily rollback without development teem assistance.
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)