Incident Review: Image pulling from ops registry failing

Incident Review

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics

Who was impacted by this incident? (i.e. external customers, internal customers): Only internal customers were impacted
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...) Pulls of the ops registry were failing due to missing data
How many customers were affected? Only internal customers were impacted
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)? N/A

On 2023-10-18 https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/storage-buckets/-/merge_requests/112 was merged to start deleting registry data under the docker/registry/v2/
On 2023-10-24 at 04:17 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/6990 was merged to bump the version across multiple environments to 11.2.1.
Because our ops environment still uses the docker/registry/v2/ path, we started deleting objects in the ops Registry, which caused missing digest errors on image pull.

How was the incident detected?
1. Incident was declared by @rpereira2 when we noticed image pull failures. We did not detect this through any automated monitoring.
How could detection time be improved?
1. I'm not sure if there is any way detection could be improved other than continuous blackbox monitoring for image pulls.
How was the root cause diagnosed?
1. Once we saw that there was data loss, an SRE noticed the lifecycle rule. This was discovered at 11:23 UTC, approximately 1 hour after the incident was declared.
How could time to diagnosis be improved?
1. Given that the incident occurred many hours after the Terraform change was applied, it might have helped to have a clearer list of changes in the last 24h. The change was buried in a module update, so exposing those changes might have been helpful.
How did we reach the point where we knew how to mitigate the impact?
1. The main mitigation was to restore images from .com, which was completed around 13:00 UTC.
How could time to mitigation be improved?
1. There was some confusion around object versioning, how to restore versions, identifying objects that have been deleted. In the end we decided not to take that approach but it was a significant distraction during the recovery.

Did we have other events in the past with the same root cause? No
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident? No
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue. Yes, https://gitlab.com/gitlab-org/gitlab/-/issues/378289#note_1605989340

Edited Oct 27, 2023 by John Jarvis