Incident Review: Image pulling from ops registry failing
Incident Review
The DRI for the incident review is the issue assignee.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers): Only internal customers were impacted
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...) Pulls of the ops registry were failing due to missing data
- How many customers were affected? Only internal customers were impacted
- If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)? N/A
What were the root causes?
- On 2023-10-18 https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/storage-buckets/-/merge_requests/112 was merged to start deleting registry data under the
docker/registry/v2/
- On 2023-10-24 at 04:17 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/6990 was merged to bump the version across multiple environments to
11.2.1
. - Because our ops environment still uses the
docker/registry/v2/
path, we started deleting objects in the ops Registry, which caused missing digest errors on image pull.
Incident Response Analysis
-
How was the incident detected?
- Incident was declared by @rpereira2 when we noticed image pull failures. We did not detect this through any automated monitoring.
-
How could detection time be improved?
- I'm not sure if there is any way detection could be improved other than continuous blackbox monitoring for image pulls.
-
How was the root cause diagnosed?
- Once we saw that there was data loss, an SRE noticed the lifecycle rule. This was discovered at 11:23 UTC, approximately 1 hour after the incident was declared.
-
How could time to diagnosis be improved?
- Given that the incident occurred many hours after the Terraform change was applied, it might have helped to have a clearer list of changes in the last 24h. The change was buried in a module update, so exposing those changes might have been helpful.
-
How did we reach the point where we knew how to mitigate the impact?
- The main mitigation was to restore images from .com, which was completed around 13:00 UTC.
-
How could time to mitigation be improved?
- There was some confusion around object versioning, how to restore versions, identifying objects that have been deleted. In the end we decided not to take that approach but it was a significant distraction during the recovery.
Post Incident Analysis
- Did we have other events in the past with the same root cause? No
- Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident? No
- Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue. Yes, https://gitlab.com/gitlab-org/gitlab/-/issues/378289#note_1605989340
What went well?
- Collaboration on the incident call and working through possible fixes.
Guidelines
Edited by John Jarvis