Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
...
Note:
In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally.
By default, all information we can share, will be public, in accordance to our transparency value.
Click to expand or collapse the Incident Review section.
Incident Review
Summary
Service(s) affected:
Team attribution:
Time to detection:
Minutes downtime or degradation:
Metrics
Customer Impact
Who was impacted by this incident? (i.e. external customers, internal customers)
...
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
...
How many customers were affected?
...
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
...
What were the root causes?
...
Incident Response Analysis
How was the incident detected?
...
How could detection time be improved?
...
How was the root cause diagnosed?
...
How could time to diagnosis be improved?
...
How did we reach the point where we knew how to mitigate the impact?
...
How could time to mitigation be improved?
...
What went well?
...
Post Incident Analysis
Did we have other events in the past with the same root cause?
...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
From the incident Slack channel - we'll need to prepare a new base image for the runners using GCOS and then role it out. @tmaczukin will get started on this at the start of next week.
steve@private-runners-manager-3.gitlab.com:~$ sudo -H docker-machine create --driver=google --google-project=gitlab-ci-155816 --google-machine-type=n2d-standard-2 --google-username=cos --google-use-internal-ip --engine-registry-mirror=https://mirror.gcr.io --google-zone=us-east1-d --google-machine-image=gitlab-ci-155816/global/images/runners-cos-stable-swtich-to-google-cos --google-metadata=cos-update-strategy=update_disabled --google-metadata-from-file=user-data=/etc/gitlab-runner/cloud-config.conf --google-disk-size=50 steveazz-new-image-qaRunning pre-create checks...(steveazz-new-image-qa) Check that the project exists(steveazz-new-image-qa) Check if the instance already existsCreating machine...(steveazz-new-image-qa) Generating SSH Key(steveazz-new-image-qa) Creating host...(steveazz-new-image-qa) Opening firewall ports(steveazz-new-image-qa) Creating instance(steveazz-new-image-qa) Waiting for Instance(steveazz-new-image-qa) Uploading SSH KeyWaiting for machine to be running, this may take a few minutes...Detecting operating system of created instance...Waiting for SSH to be available...Detecting the provisioner...Provisioning with cos...Copying certs to the local machine directory...Copying certs to the remote machine...Setting Docker configuration on the remote daemon...Checking connection to Docker...Docker is up and running!To see how to connect your Docker Client to the Docker Engine running on this virtual machine, run: docker-machine env steveazz-new-image-qa
steve@private-runners-manager-3.gitlab.com:~$ sudo -H docker-machine ssh steveazz-new-image-qacos@steveazz-new-image-qa ~ $ cat /etc/os-releaseNAME="Container-Optimized OS"ID=cosPRETTY_NAME="Container-Optimized OS from Google"HOME_URL="https://cloud.google.com/container-optimized-os/docs"BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"KERNEL_COMMIT_ID=40d34c852fd64d5472b62a4e98147b2cde2df869GOOGLE_CRASH_ID=LakituGOOGLE_METRICS_PRODUCT_ID=26VERSION=85VERSION_ID=85BUILD_ID=13310.1260.26
cos@steveazz-new-image-qa ~ $ docker run --rm -it ruby:3.0.2-alpine3.14Unable to find image 'ruby:3.0.2-alpine3.14' locally3.0.2-alpine3.14: Pulling from library/ruby5843afab3874: Already exists7a470944fd7c: Pull complete6b41d833dfb2: Pull complete43ea7a74f14b: Pull complete659841b36c13: Pull completeDigest: sha256:c0dc1d12a3cd3bc7826452cc185570f996467ccca90253481bc35f213d2219b8Status: Downloaded newer image for ruby:3.0.2-alpine3.14irb(main):001:0> quit
@steveazz - I'm trying to understand the impact of this incident. Was it only for gitlab-org runners or did this affect the entire Shared Runner fleet?
@kencjohnstongitlab-org was the most affected fleet since we ran into issues like gitlab-org/gitlab#335641 (closed). However the shared runner fleet also has the same problem we just didn't have users trigger the problem that much.
Now I'm assuming you are asking because of the urgency of this, @tmaczukin and I are still working on a rollout plan for the new OS for the shared runner fleet. This is a really big change in OS for our shared runners which is why we need to do a due diligence since it can break a lot of workflows if we do any missconfiguration. We are planning to roll this new OS slowly on shared fleet to fully understand the problem assoicated with it.
Now I'm assuming you are asking because of the urgency of this
Actually, I wasn't. I was asking because it was the only Incident with the ServiceCI Runners tag in the last two weeks and was an S4 so I was trying to understand the level of impact. I was considering highlighting the fact that we've had two weeks without a customer-impacting Incident related to ServiceCI Runners. Would you say that that statement is correct?