This does not directly have any customer impact, but it ended up causing #18717 (closed) which could have large impact, but that was caught and mitigated before users could notice the latency.
Current Status
We are rolling-forward with the new auto-deploy package since its image is tagged properly, we sill aren't clear what deleted the tag from the previous image of gitlab-base
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in our handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Security Note: If anything abnormal is found during the course of your investigation, please do not hesitate to contact security.
We are having ImagePull failures across GPRD zonal and regional clusters as part of a release:
The image failures are as follows:
Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 25m default-scheduler Successfully assigned gitlab/gitlab-webservice-git-7c55b5f4b7-d6stm to gke-gprd-us-east1-b-generic-2-92c921ca-snfs Normal Pulling 25m kubelet Pulling image "busybox:latest" Normal Pulled 25m kubelet Successfully pulled image "busybox:latest" in 661ms (661ms including waiting). Image size: 2166802 bytes. Normal Created 25m kubelet Created container write-instance-name Normal Started 25m kubelet Started container write-instance-name Normal Pulled 25m kubelet Container image "us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/certificates:17-5-202410151605-e651cfe6c38" already present on machine Normal Started 25m kubelet Started container certificates Normal Created 25m kubelet Created container certificates Warning Failed 25m (x3 over 25m) kubelet Failed to pull image "us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-base:17-5-202410151605-e651cfe6c38": rpc error: code = NotFound desc = failed to pull and unpack image "us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-base:17-5-202410151605-e651cfe6c38": failed to resolve reference "us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-base:17-5-202410151605-e651cfe6c38": us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-base:17-5-202410151605-e651cfe6c38: not found Warning Failed 25m (x3 over 25m) kubelet Error: ErrImagePull Warning Failed 24m (x4 over 25m) kubelet Error: ImagePullBackOff Normal Pulling 24m (x4 over 25m) kubelet Pulling image "us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-base:17-5-202410151605-e651cfe6c38" Normal BackOff 45s (x107 over 25m) kubelet Back-off pulling image "us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-base:17-5-202410151605-e651cfe6c38"
This tag did work successfully on staging (zonal cluster)
Presently, the rollout will fail shortly but we cannot roll forward. Delivery has looked all over the place and we're not finding any good answers yet.
@ahmadsherif this is probably going to rollover your shift too, but currently there is nothing that we need to do as EOC, the release managers are deploying the new auto-deploy package, which seem to have been tagged properly.
I am raising the severity of this as :s2: since it ended up causing the #18717 (closed) and the blast could have been higher as because of it, the Mailroom and most of our Sidekiq started to suffer.
Thanks for taking part in this incident! It looks like this incident needs
an async Incident Review issue, please use the Incident Review link in
the incident's description to create one.
We're posting this message because this issue meets the following criteria:
If you are certain that this incident doesn't require an incident review, add the
IncidentReviewNotNeeded label to this issue with a note explaining why.
We started to get page from different components that the pod is not in running status.
We identified the problem is a missing image, and verifying the image didn't had the tag on the artifact registry.
We decided to go ahead with the deploy as we thought that it would eventually fail and helm would rollback.
But we didn't notice that the image was actually pushed and got deleted somehow. As we can see the same image was running but didn't after we scaled up: #18715 (comment 2160499413)
This ended up building the queue in Sidekiq and causing #18717 (closed)
We decided to manually tag the correct image with the tag as the fastest way to mitigate the issue, ref: #18717 (comment 2160431983)
And since the new auto-deploy tagged the image correctly we decided to roll-forward with the new image.
Above, mentioned gitlab-base:17-5-202410151605-e651cfe6c38, elsewhere, seemingly the pipeline that would have produced it. This pipeline certainly pushed this image, according to the final artifacts listing.
The tag is present on .org, though I will admit that the screenshot above does leave some confusion. The job which produced the tag shows Image exists already and will not be built again., thus it determined there had been no functional changes to the image, or the base image over which it was built (debian-stable records this). The existing image was then pulled, re-certified, and pushed back under the same tag.
This is the image manifest that was made and pushed. That is what is complained about being "missing", though under the tag. I see above comments regarding gitlab-base@sha256:3edb42f83cf462f679a32f2dee5d0a7c38c63efd012c3327d40a34ce3837c3dc, but I do not understand the misalignment's cause.
@WarheadsSE from what I can tell from logs the image was present on the artifact registry for a while but the tag disappeared / deleted without any audit trace.
I've just been having a look at this with @tkhandelwal3.
One odd thing we noticed when looking through the asset inventory for the affected image is that the manifest seems to be getting deleted at approximately 2024-10-15T20:55:49Z, and recreated at 2024-10-15T22:50:40.563551Z without any of the previously defined tags. The recreation could be cosign just acting normally recreating the manifest as it presumably didn't exist at that time.
We can't see any request to that resource in the logs for the project during that time frame, so it's unclear exactly what is happening.
I would hazard a guess that this is either a bug in artifact registry, or some kind of clean up behaviour. Either way I think a support ticket to GCP is probably a good course of action.
This err stands out: invalid manifest list/index reference(s), please report this issue to GitLab at https://gitlab.com/gitlab-org/container-registry/-/issues/409
@rehab the times are all UTC, it's quite confusing in the UI as there are some that are local time but it appears the change times are all in UTC. I'll continue using UTC for this comment.
The actual image doesn't look like it's changed in quite a while, which is why we're seeing tags added to the existing image manifest. e.g. the reference you mention was added at 2024-10-15T16:49:28.382575Z, and was only removed at 2024-10-15T20:55:49Z.
I do wonder if there is a process that deleted the image manifest after the invalid manifest failure you mention.
Having a look further back in the history for that image, it was created for 17-3-stable, but was then only added to until 2024-10-07T17:13:45.079855Z, when some tags were removed.
Given the timing of this job and the name of the job it looks very suspect, but unfortunately the job logs are truncated so I can't see exactly what happened in this case.
I see this pipeline for the autodeploy - 17-5-202410151605 which started nearly the same time as this pipeline for buildx-test3-0. The sync job did not run for the buildx-test3-0 as the pipeline failed, but it did complete the build gitlab-base job. Did that upload gitlab-base image (to dev.gitlab.org only) with no/only one tag? Would that have somehow tripped up the cosign / sync job from the autodeploy?
Though, having a look at the cosign codebase for the copy operation though it looks like if a tag can't be found then it should just skip it. I could only see this happening if we could successfully push and pull an empty manifest to the dev GitLab container registry.
The thing that I'm really interested in knowing, but can't find a way to see, is the actor that caused the deletion. It's not captured in the Google audit logs for some reason.
@dawsmith Since push_tags was called: yes, an image tag was pushed to Dev after being ran through cosign.
There should not have been a "removal" of any tag. The linked auto-deploy pipeline started, and finished, long before the second pipeline's job started.
Noting, that both of the linked jobs refer to identical container version expectations.
Image exists already and will not be built again.Image exists - dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-base:831cbae24236df6791f209c7e5c4f6b5e231183d@sha256:d4a2d2b8a0f594ec856143aa91ead6be49f3254e3d891088fc400123c0e0abd7
As to 17-3-stable, here is the latest pipeline from that branch, and the gitlab-base job. This Job ran ~2.5 hours later than the previous two. That pipeline determined the gitlab-base's CONTAINER_VERSION was to be 1fbd1925e053719a2a0c163add7ac0dac166990f. That does not align with 831cbae24236df6791f209c7e5c4f6b5e231183d.
This incident was automatically closed because it has the IncidentResolved label.
Note: All incidents are closed automatically when they are resolved, even when there is a pending
review. Please see the Incident Workflow
section on the Incident Management handbook page for more information.
This issue has IncidentReviewNeeded set.
Please either fix this by adding one of the corresponding labels, or add a ::NotNeeded scoped label with an explanation
if you are sure that it is not needed. Adding the ::NotNeeded scoped label will prevent these notifications, otherwise
this notice will repeat in 7 days.