Investigate gitlab-kas image backoff error during hot patching production practice
Context
During release 16.7 monthly release hot patching production practice (summary in monthly release issue comment), gitlab-kas
deployment was stuck in ImagePullBackoff
during deployment due to:
Failed to pull image "us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-kas:16-7-202312081207-cc189b206a5-PATCHED-20231208193313": rpc error: code = NotFound desc = failed to pull and unpack image "us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-kas:16-7-202312081207-cc189b206a5-PATCHED-20231208193313": failed to resolve reference "us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-kas:16-7-202312081207-cc189b206a5-PATCHED-20231208193313": us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-kas:16-7-202312081207-cc189b206a5-PATCHED-20231208193313: not found
The deployment job timed out, and helm
automatically rolled it back. We cancelled the practice to investigate it further on this issue.
For further context, the last successful hot patching production practice also referenced a similar image with -PATCHED-
: https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/jobs/11471752#L214
We should investigate if there's an issue with uploading the image. I came across a potentially related commit in auto-deploy-image-check
folder of k8s-workload/gitlab-com
: gitlab-com/gl-infra/k8s-workloads/gitlab-com@ae585446. Since the files in that folder are supposed to be used to "compare during auto-deploy pipeline runs as a safety check to ensure that we are only updating images and not other configuration" (source), the image created/uploaded by patcher might be failing the check.