2024-10-15: Kube containers for git unable to start

added an incident timeline event

added IncidentActive ServiceGit Source::IMAIncidentDeclare a:KubeContainersWaitingInError incident severity3 labels

assigned to @stejacks-gitlab and @astarovoytov

changed the description

added a resource link

changed the severity to Medium - S3

mentioned in issue on-call-handovers#5409 (closed)

added blocks deployments label

We are having ImagePull failures across GPRD zonal and regional clusters as part of a release:

The image failures are as follows:

  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  25m                  default-scheduler  Successfully assigned gitlab/gitlab-webservice-git-7c55b5f4b7-d6stm to gke-gprd-us-east1-b-generic-2-92c921ca-snfs
  Normal   Pulling    25m                  kubelet            Pulling image "busybox:latest"
  Normal   Pulled     25m                  kubelet            Successfully pulled image "busybox:latest" in 661ms (661ms including waiting). Image size: 2166802 bytes.
  Normal   Created    25m                  kubelet            Created container write-instance-name
  Normal   Started    25m                  kubelet            Started container write-instance-name
  Normal   Pulled     25m                  kubelet            Container image "us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/certificates:17-5-202410151605-e651cfe6c38" already present on machine
  Normal   Started    25m                  kubelet            Started container certificates
  Normal   Created    25m                  kubelet            Created container certificates
  Warning  Failed     25m (x3 over 25m)    kubelet            Failed to pull image "us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-base:17-5-202410151605-e651cfe6c38": rpc error: code = NotFound desc = failed to pull and unpack image "us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-base:17-5-202410151605-e651cfe6c38": failed to resolve reference "us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-base:17-5-202410151605-e651cfe6c38": us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-base:17-5-202410151605-e651cfe6c38: not found
  Warning  Failed     25m (x3 over 25m)    kubelet            Error: ErrImagePull
  Warning  Failed     24m (x4 over 25m)    kubelet            Error: ImagePullBackOff
  Normal   Pulling    24m (x4 over 25m)    kubelet            Pulling image "us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-base:17-5-202410151605-e651cfe6c38"
  Normal   BackOff    45s (x107 over 25m)  kubelet            Back-off pulling image "us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-base:17-5-202410151605-e651cfe6c38"

This tag did work successfully on staging (zonal cluster)

gitlab-sidekiq-catchall-v2-5ccd5d887-qjx9g                 1/1     Running   0          128m

stejacks@atlas~ $ kubectl -n gitlab describe pod gitlab-sidekiq-catchall-v2-5ccd5d887-qjx9g
Name:         gitlab-sidekiq-catchall-v2-5ccd5d887-qjx9g
Namespace:    gitlab
Priority:     0
Node:         gke-gstg-gitlab-gke-generic-2-998a65de-cpxd/10.224.34.19
Start Time:   Tue, 15 Oct 2024 15:49:05 -0400
Labels:       app=sidekiq
              chart=sidekiq-8.4.2
...
  configure:
    Container ID:  containerd://10ea88443f26f94b553ee8f04f5ad1ca38fb156d3891bd6cd73bcd44021a1b6a
    Image:         us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-base:17-5-202410151605-e651cfe6c38
    Image ID:      us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-base@sha256:d4a2d2b8a0f594ec856143aa91ead6be49f3254e3d891088fc400123c0e0abd7
    Port:          <none>

But not production (also zonal):

gitlab-sidekiq-catchall-v2-8487c778d8-7tv5t                0/1     Init:ImagePullBackOff   0               102s

stejacks@atlas~ $ kubectl -n gitlab describe pod gitlab-sidekiq-catchall-v2-8487c778d8-7tv5t
Name:         gitlab-sidekiq-catchall-v2-8487c778d8-7tv5t
Namespace:    gitlab
Priority:     0
Node:         gke-gprd-gitlab-gke-generic-mem-3-f5882bb2-9vwq/10.216.8.53
Start Time:   Tue, 15 Oct 2024 17:59:30 -0400
Labels:       app=sidekiq
              chart=sidekiq-8.4.2
...
 configure:
    Container ID:
    Image:         us-east1-docker.pkg.dev/gitlab-com-artifact-registry/images/gitlab-base:17-5-202410151605-e651cfe6c38
    Image ID:
    Port:          <none>
    Host Port:     <none>

We confirmed that the SHA exists but it is not tagged:

source

And nothing has been tagged since October 10th in gitlab-base:

source

But tags are working for other releases:

source

The actual job that runs the copy to GCR worked: https://console.cloud.google.com/artifacts/docker/gitlab-com-artifact-registry/us-east1/images/gitlab-webservice-ee/sha256:bfd56e362eb28d4d99f71eb6748bcfc459a2ca639a3f3d36855c594d71f1b782

@dawsmith has pinged the deploy team for help due to potentially this: https://gitlab.com/gitlab-org/distribution/team-tasks/-/issues/1629

Presently, the rollout will fail shortly but we cannot roll forward. Delivery has looked all over the place and we're not finding any good answers yet.

@tkhandelwal3 this is going to roll over into your shift, FYI.

Thank you for the summary @stejacks-gitlab

@ahmadsherif this is probably going to rollover your shift too, but currently there is nothing that we need to do as EOC, the release managers are deploying the new auto-deploy package, which seem to have been tagged properly.

mentioned in incident #18717 (closed)

As a corrective action we should check if the image that we are deploying exists before rolling out the deployment.

added CorrectiveActionsNeeded label

Since this incident end up causing #18717 (closed) we added the missing tag manually, to image see #18717 (comment 2160431983) for more info.

And since we see the new images are tagged properly:

source

I'll be marking this as mitigated and we'll be rolling forward with the new deployment.

It also ended up causing mailroom to not work and triggered

Since it had one replica and the pod was failing with ImagePullBackOff error.

added IncidentMitigated label and removed IncidentActive label

removed blocks deployments label

I am raising the severity of this as :s2: since it ended up causing the #18717 (closed) and the blast could have been higher as because of it, the Mailroom and most of our Sidekiq started to suffer.

added severity2 label and removed severity3 label

changed the severity to High - S2

Hi @astarovoytov @stejacks-gitlab,

Thanks for taking part in this incident! It looks like this incident needs an async Incident Review issue, please use the Incident Review link in the incident's description to create one.

We're posting this message because this issue meets the following criteria:

It is severity1 / severity2, or has a review-requested label
There is no related issue with an incident-review label

If you are certain that this incident doesn't require an incident review, add the IncidentReviewNotNeeded label to this issue with a note explaining why.

Thanks for your help!

You are welcome to help improve this comment.

added IncidentReviewNeeded label

assigned to @tkhandelwal3 and @thiagocsf

mentioned in issue on-call-handovers#5410 (closed)

changed the description

Summary:

Putting all the debug notes here from slack:

We started to get page from different components that the pod is not in running status.
We identified the problem is a missing image, and verifying the image didn't had the tag on the artifact registry.
We decided to go ahead with the deploy as we thought that it would eventually fail and helm would rollback.
But we didn't notice that the image was actually pushed and got deleted somehow. As we can see the same image was running but didn't after we scaled up: #18715 (comment 2160499413)
This ended up building the queue in Sidekiq and causing #18717 (closed)
We decided to manually tag the correct image with the tag as the fastest way to mitigate the issue, ref: #18717 (comment 2160431983)
And since the new auto-deploy tagged the image correctly we decided to roll-forward with the new image.
As we were rolling forward we encountered https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/15705937 which was caused due to a CR to HAProxy config https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18704
We reverted the the HAProxy change https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5191
Which made the pipeline green, and we deployed the gprd with the new auto-deploy version
The only thing we need to find is how that the image tag got deleted , since there are not audit logs to that endpoint, ref: https://console.cloud.google.com/logs/query;query=resource.type%3D%22audited_resource%22%0AprotoPayload.request.requestUrl%3D%22%2Fv2%2Fgitlab-com-artifact-registry%2Fimages%2Fgitlab-base%2Fmanifests%2F17-5-202410151405-94a4db1780d%22%0A;summaryFields=protoPayload%252Frequest%252FrequestUrl:false:32:beginning;cursorTimestamp=2024-10-15T14:53:27.092730975Z;startTime=2024-10-08T22:53:23.490Z;endTime=2024-10-15T22:53:23.490Z?project=gitlab-com-artifact-registry
One theory is that it was a transient failure realted to cosign and expired token which hapend yesterday, ref: https://gitlab.com/gitlab-org/distribution/team-tasks/-/issues/1629
The new prod auto-deploy has been deployed now.
We'll keep this open and see if the tag is deleted again, if it doesn't it is probably more like a transient issue related to token expiry.

@tkhandelwal3 is it accurate to say that the rootcause for this incident is an external dependency?

@anganga My 2c: I feel like RootCauseExternal-Dependency is more appropriate for 3rd party vendors or something we don't have any control over. If it turns out to be a bug in the tooling (our tooling!), I think RootCauseKnown-Software-Issue or RootCauseConfig-Change (if it was precipitated by the token expiring) would be more fitting.

Yeah I agree, until we find what the actual reason of the tag getting removed we can't add the RootCauseExternal-Dependency

added Deploys-blocked-gprd3hr Deploys-blocked-gstg3hr labels

mentioned in issue gitlab-org/release/tasks#13716 (closed)

marked this incident as related to gitlab-org/release/tasks#13716 (closed)

removed the relation with gitlab-org/release/tasks#13716 (closed)

mentioned in issue delivery#20603 (closed)

marked this incident as related to delivery#20603 (closed)

added RootCauseKnown-Software-Issue label

removed CorrectiveActionsNeeded label

Investigation

Above, mentioned gitlab-base:17-5-202410151605-e651cfe6c38, elsewhere, seemingly the pipeline that would have produced it. This pipeline certainly pushed this image, according to the final artifacts listing.

You may find that image & tag here

The tag is present on .org, though I will admit that the screenshot above does leave some confusion. The job which produced the tag shows Image exists already and will not be built again., thus it determined there had been no functional changes to the image, or the base image over which it was built (debian-stable records this). The existing image was then pulled, re-certified, and pushed back under the same tag.

That evidences in the log with:

Image exists - dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-base:831cbae24236df6791f209c7e5c4f6b5e231183d@sha256:d4a2d2b8a0f594ec856143aa91ead6be49f3254e3d891088fc400123c0e0abd7 Signing image digest: dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-base@sha256:d4a2d2b8a0f594ec856143aa91ead6be49f3254e3d891088fc400123c0e0abd7

Summary

I am concerned as to why this did not appear to be present in the GCR, despite the pipeline passing without issues.

gitlab-base@sha256:d4a2d2b8a0f594ec856143aa91ead6be49f3254e3d891088fc400123c0e0abd7

This is the image manifest that was made and pushed. That is what is complained about being "missing", though under the tag. I see above comments regarding gitlab-base@sha256:3edb42f83cf462f679a32f2dee5d0a7c38c63efd012c3327d40a34ce3837c3dc, but I do not understand the misalignment's cause.

@WarheadsSE from what I can tell from logs the image was present on the artifact registry for a while but the tag disappeared / deleted without any audit trace.

see my above thread: #18715 (comment 2160499413)

see the request was made:

I've just been having a look at this with @tkhandelwal3.

One odd thing we noticed when looking through the asset inventory for the affected image is that the manifest seems to be getting deleted at approximately 2024-10-15T20:55:49Z, and recreated at 2024-10-15T22:50:40.563551Z without any of the previously defined tags. The recreation could be cosign just acting normally recreating the manifest as it presumably didn't exist at that time.

We can't see any request to that resource in the logs for the project during that time frame, so it's unclear exactly what is happening.

I would hazard a guess that this is either a bug in artifact registry, or some kind of clean up behaviour. Either way I think a support ticket to GCP is probably a good course of action.

Deletion screenshot

Recreation screenshot

Are the times mentioned here in UTC @jcstephenson @tkhandelwal3? Asking because I can see mentions of that tag in Kibana (see https://log.gprd.gitlab.net/app/r/s/ldiow) as early as Oct, 15 - 16:16 UTC.

I assume this "payload copied" is when the tag was first created in Production?

All the msgs (unique)

ref: https://log.gprd.gitlab.net/app/r/s/IhVdN

This err stands out: invalid manifest list/index reference(s), please report this issue to GitLab at https://gitlab.com/gitlab-org/container-registry/-/issues/409

Referenced issue: gitlab-org/container-registry#409 (closed).

@rehab the times are all UTC, it's quite confusing in the UI as there are some that are local time but it appears the change times are all in UTC. I'll continue using UTC for this comment.

The actual image doesn't look like it's changed in quite a while, which is why we're seeing tags added to the existing image manifest. e.g. the reference you mention was added at 2024-10-15T16:49:28.382575Z, and was only removed at 2024-10-15T20:55:49Z.

ref: Asset inventory

I do wonder if there is a process that deleted the image manifest after the invalid manifest failure you mention.

Having a look further back in the history for that image, it was created for 17-3-stable, but was then only added to until 2024-10-07T17:13:45.079855Z, when some tags were removed.

ref: Asset inventory (same as before)

Given the timing of this job and the name of the job it looks very suspect, but unfortunately the job logs are truncated so I can't see exactly what happened in this case.

Adding some more details and questions @jcstephenson @WarheadsSE -

I see this pipeline for the autodeploy - 17-5-202410151605 which started nearly the same time as this pipeline for buildx-test3-0. The sync job did not run for the buildx-test3-0 as the pipeline failed, but it did complete the build gitlab-base job. Did that upload gitlab-base image (to dev.gitlab.org only) with no/only one tag? Would that have somehow tripped up the cosign / sync job from the autodeploy?

@dawsmith that sounds like a plausible theory.

Though, having a look at the cosign codebase for the copy operation though it looks like if a tag can't be found then it should just skip it. I could only see this happening if we could successfully push and pull an empty manifest to the dev GitLab container registry.

The thing that I'm really interested in knowing, but can't find a way to see, is the actor that caused the deletion. It's not captured in the Google audit logs for some reason.

@dawsmith Since push_tags was called: yes, an image tag was pushed to Dev after being ran through cosign.

There should not have been a "removal" of any tag. The linked auto-deploy pipeline started, and finished, long before the second pipeline's job started.

Noting, that both of the linked jobs refer to identical container version expectations.

Image exists already and will not be built again.
Image exists - dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-base:831cbae24236df6791f209c7e5c4f6b5e231183d@sha256:d4a2d2b8a0f594ec856143aa91ead6be49f3254e3d891088fc400123c0e0abd7

Image exists - dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-base:831cbae24236df6791f209c7e5c4f6b5e231183d@sha256:d4a2d2b8a0f594ec856143aa91ead6be49f3254e3d891088fc400123c0e0abd7

As to 17-3-stable, here is the latest pipeline from that branch, and the gitlab-base job. This Job ran ~2.5 hours later than the previous two. That pipeline determined the gitlab-base's CONTAINER_VERSION was to be 1fbd1925e053719a2a0c163add7ac0dac166990f. That does not align with 831cbae24236df6791f209c7e5c4f6b5e231183d.

added Deploys-blocked-gprd8hr Deploys-blocked-gstg8hr labels and removed Deploys-blocked-gprd3hr Deploys-blocked-gstg3hr labels

added an incident timeline event

Release managers confirmed that this hasn't repeated, we'll need to do more digging on RC.

added IncidentResolved label and removed IncidentMitigated label

This incident was automatically closed because it has the IncidentResolved label.

Note: All incidents are closed automatically when they are resolved, even when there is a pending review. Please see the Incident Workflow section on the Incident Management handbook page for more information.

closed

changed the incident status to Resolved by closing the incident

mentioned in issue gitlab-org/release/tasks#13773 (closed)

mentioned in issue reliability-reports#266 (closed)

Hi @thiagocsf @astarovoytov @stejacks-gitlab @tkhandelwal3 @kkyrala @jarv

This issue has IncidentReviewNeeded set. Please either fix this by adding one of the corresponding labels, or add a ::NotNeeded scoped label with an explanation if you are sure that it is not needed. Adding the ::NotNeeded scoped label will prevent these notifications, otherwise this notice will repeat in 7 days.

Once the criteria is met, IncidentReviewNeeded label will be removed automatically. See the handbook page on ::Needed and ::NotNeeded labels for more information about how the labels are set and removed.

You are welcome to help improve this comment.

@thiagocsf @stejacks-gitlab @astarovoytov

Should we add the backstage label to this incident if it didn't have any direct customer impact?

Technically it caused the next incident that did have customer imapct, but I'm fine with labelling this one accordingly.

added backstage label

removed IncidentReviewNeeded label

mentioned in incident #18902 (closed)

mentioned in issue gitlab-org/distribution/team-tasks#1662 (closed)

2024-10-15: Kube containers for git unable to start

Customer Impact

Current Status

References and helpful links

Deployment Guidance

Child items ...

Activity

Summary:

Investigation

Summary

2024-10-15: Kube containers for git unable to start

Customer Impact

Current Status

References and helpful links

Deployment Guidance

Relates to

Activity

Summary:

Investigation

Summary