2021-07-09 - CI jobs using alpine 3.14 based images are failing

added IncidentActive ServiceCI Runners Source::IMAIncidentDeclare incident severity4 labels

assigned to @igorwwwwwwwwwwwwwwwwwwww, @brentnewton, and @csouthard

changed the severity to Low - S4

Slack channel here.

✅ Production has no active severity1 / severity2 incidents or in-progress change issues.

✅ Production has no active incidents

✅ Production has no change requests in progress

🔮 GitLab Deployment Health Status - overview

✅ cny-api
✅ cny-git
✅ cny-web
✅ main-api
✅ main-git
✅ main-sidekiq
✅ main-web

✅ no active deployment

marked this issue as related to gitlab-org/gitlab#335641 (closed)

Manual QA

Using the image built from https://dev.gitlab.org/cookbooks/packer-runner-machines/-/merge_requests/40 runners-cos-stable-swtich-to-google-cos we can see it working now

steve@private-runners-manager-3.gitlab.com:~$ sudo -H docker-machine create --driver=google --google-project=gitlab-ci-155816 --google-machine-type=n2d-standard-2 --google-username=cos --google-use-internal-ip --engine-registry-mirror=https://mirror.gcr.io --google-zone=us-east1-d --google-machine-image=gitlab-ci-155816/global/images/runners-cos-stable-swtich-to-google-cos --google-metadata=cos-update-strategy=update_disabled --google-metadata-from-file=user-data=/etc/gitlab-runner/cloud-config.conf --google-disk-size=50 steveazz-new-image-qa
Running pre-create checks...
(steveazz-new-image-qa) Check that the project exists
(steveazz-new-image-qa) Check if the instance already exists
Creating machine...
(steveazz-new-image-qa) Generating SSH Key
(steveazz-new-image-qa) Creating host...
(steveazz-new-image-qa) Opening firewall ports
(steveazz-new-image-qa) Creating instance
(steveazz-new-image-qa) Waiting for Instance
(steveazz-new-image-qa) Uploading SSH Key
Waiting for machine to be running, this may take a few minutes...
Detecting operating system of created instance...
Waiting for SSH to be available...
Detecting the provisioner...
Provisioning with cos...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
Checking connection to Docker...
Docker is up and running!
To see how to connect your Docker Client to the Docker Engine running on this virtual machine, run: docker-machine env steveazz-new-image-qa

steve@private-runners-manager-3.gitlab.com:~$ sudo -H docker-machine ssh steveazz-new-image-qa
cos@steveazz-new-image-qa ~ $ cat /etc/os-release
NAME="Container-Optimized OS"
ID=cos
PRETTY_NAME="Container-Optimized OS from Google"
HOME_URL="https://cloud.google.com/container-optimized-os/docs"
BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"
KERNEL_COMMIT_ID=40d34c852fd64d5472b62a4e98147b2cde2df869
GOOGLE_CRASH_ID=Lakitu
GOOGLE_METRICS_PRODUCT_ID=26
VERSION=85
VERSION_ID=85
BUILD_ID=13310.1260.26

cos@steveazz-new-image-qa ~ $ docker run --rm -it ruby:3.0.2-alpine3.14
Unable to find image 'ruby:3.0.2-alpine3.14' locally
3.0.2-alpine3.14: Pulling from library/ruby
5843afab3874: Already exists
7a470944fd7c: Pull complete
6b41d833dfb2: Pull complete
43ea7a74f14b: Pull complete
659841b36c13: Pull complete
Digest: sha256:c0dc1d12a3cd3bc7826452cc185570f996467ccca90253481bc35f213d2219b8
Status: Downloaded newer image for ruby:3.0.2-alpine3.14
irb(main):001:0> quit

Nice, looking forward to getting this rolled out.

mentioned in issue on-call-handovers#1837 (closed)

mentioned in issue on-call-handovers#1838 (closed)

mentioned in issue on-call-handovers#1839 (closed)

mentioned in issue on-call-handovers#1840 (closed)

mentioned in issue on-call-handovers#1841 (closed)

mentioned in issue on-call-handovers#1842 (closed)

mentioned in issue on-call-handovers#1843 (closed)

@steveazz - I'm trying to understand the impact of this incident. Was it only for gitlab-org runners or did this affect the entire Shared Runner fleet?

@kencjohnston gitlab-org was the most affected fleet since we ran into issues like gitlab-org/gitlab#335641 (closed). However the shared runner fleet also has the same problem we just didn't have users trigger the problem that much.

Now I'm assuming you are asking because of the urgency of this, @tmaczukin and I are still working on a rollout plan for the new OS for the shared runner fleet. This is a really big change in OS for our shared runners which is why we need to do a due diligence since it can break a lot of workflows if we do any missconfiguration. We are planning to roll this new OS slowly on shared fleet to fully understand the problem assoicated with it.

We are also seeing some DNS problems in https://gitlab.com/gitlab-org/gitlab/-/issues/335871

@steveazz

Now I'm assuming you are asking because of the urgency of this

Actually, I wasn't. I was asking because it was the only Incident with the ServiceCI Runners tag in the last two weeks and was an S4 so I was trying to understand the level of impact. I was considering highlighting the fact that we've had two weeks without a customer-impacting Incident related to ServiceCI Runners. Would you say that that statement is correct?

@kencjohnston that is correct yes! ServiceCI Runners has been pretty sable for the past 2 weeks 🚀

mentioned in issue on-call-handovers#1844 (closed)

mentioned in issue on-call-handovers#1845 (closed)

mentioned in issue on-call-handovers#1846 (closed)

mentioned in issue on-call-handovers#1847 (closed)

mentioned in issue on-call-handovers#1848 (closed)

mentioned in issue on-call-handovers#1849 (closed)

mentioned in issue on-call-handovers#1850 (closed)

mentioned in issue on-call-handovers#1851 (closed)

mentioned in issue on-call-handovers#1852 (closed)

mentioned in issue on-call-handovers#1853 (closed)

mentioned in issue on-call-handovers#1854 (closed)

mentioned in issue on-call-handovers#1855 (closed)

mentioned in issue on-call-handovers#1856 (closed)

mentioned in issue on-call-handovers#1857 (closed)

mentioned in issue on-call-handovers#1858 (closed)

mentioned in issue on-call-handovers#1859 (closed)

mentioned in issue on-call-handovers#1860 (closed)

@tmaczukin @steveazz Is there anything to be done here still, or can we close it out?

@igorwwwwwwwwwwwwwwwwwwww we'll start the rollout for shared fleet today with the plan laid out in https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/330#note_630338890 so we still have some work to do since so far the fix has only been rolled out to the internal shards.

mentioned in issue on-call-handovers#1861 (closed)

mentioned in issue on-call-handovers#1862 (closed)

mentioned in issue on-call-handovers#1863 (closed)

Update 2021-07-21

Done

Verified that the new image we build works as expected 👉 #5131 (closed)
Rolled out the name image to 1 (20% or jobs) to srm3 👉 #5184 (closed)

Update 2021-07-22

Done

Rolled out 60% of our shared shard with the new image 👉 #5184 (closed)

Update 2021-07-28

Done

No update from #5134 (comment 633159937) due to the hard PCL

2021-07-09 - CI jobs using alpine 3.14 based images are failing

Current Status

Timeline

Corrective Actions

Incident Review

Summary

Metrics

Customer Impact

What were the root causes?

Incident Response Analysis

Post Incident Analysis

Lessons Learned

Guidelines

Resources

Child items 0

Activity

Manual QA

Update 2021-07-21

Done

Next

Update 2021-07-22

Done

Next

Update 2021-07-28

Done

Next

2021-07-09 - CI jobs using alpine 3.14 based images are failing

Current Status

Timeline

Corrective Actions

Incident Review

Summary

Metrics

Customer Impact

What were the root causes?

Incident Response Analysis

Post Incident Analysis

Lessons Learned

Guidelines

Resources

Relates to

Activity

Manual QA

Update 2021-07-21

Done

Next

Update 2021-07-22

Done

Next

Update 2021-07-28

Done

Next