2023-10-24: Image pulling from ops registry failing
Customer Impact
No known customer-facing impact.
Current Status
Root Cause
Before https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/storage-buckets/-/commit/b1fdb5de3f58724692de9c85e3d912d4fe49f55b we had a lifecycle policy that marks all objects in docker/registry/v2/blobs/
as ARCHIVED
, then in https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/storage-buckets/-/commit/b1fdb5de3f58724692de9c85e3d912d4fe49f55b changed updated the lifecycle policy to delete all the objects that are ARCHIVED
.
These was safe to execute on gprd
and gstg
because the registry no longer use those path, but for the ops
environment this wasn't the case and it was updated for all of the environments.
Action Items
-
@sxuereb / @jdrpereira : Recover the layers to unblock the deployment -
@jarv: To remove the lifecycle rule to stop deleting the ARCHIVE
-
👉 Terraform module update https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/storage-buckets/-/merge_requests/114 -
👉 Renovate pipeline https://ops.gitlab.net/gitlab-com/gl-infra/renovate/renovate-ci/-/pipelines/2444708 -
👉 Version bump https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/7021
-
-
@sxuereb: Wait till Friday to have a go/no go for object recovery -
Have sign off from infrastructure leadership about this data loss 👉 #17020 (comment 1616771770)
-
-
@ahanselka : Create a lifecycle policy to mark all objects back to standard
forgs://gitlab-ops-registry/docker/registry/v2
-
Monitor that objects are getting moved to MULTI_REGIONAL
-
-
@sxuereb: Follow up on introducing the lifecycle that was re-introduced https://gitlab.com/gitlab-org/gitlab/-/issues/378289 -
@jarv: Create incident review issue for collecting corrective actions #17027 (closed)
Failing to pull image
- Report the problem in this incident
- Recover image, you have a few options:
- Re-build the image
- See if the image is present on https://gitlab.com/ pull it and push it to https://ops.gitlab.net
All known failures:
- https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/11619895
- https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/jobs/11620614
- #17022 (closed)
- https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/jobs/5361298981
- https://ops.gitlab.net/gitlab-com/gl-infra/charts/-/jobs/11622905
📚 References and helpful links
Recent Events (available internally only):
- Feature Flag Log - Chatops to toggle Feature Flags Documentation
- Infrastructure Configurations
- GCP Events (e.g. host failure)
Deployment Guidance
- Deployments Log | Gitlab.com Latest Updates
- Reach out to Release Managers for S1/S2 incidents to discuss Rollbacks, Hot Patching or speeding up deployments. | Rollback Runbook | Hot Patch Runbook
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
- Corrective action ❙ Infradev
- Incident Review ❙ Infra investigation followup
- Confidential Support contact ❙ QA investigation
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in our handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.