Skip to content

Incident Review: GitLab Runner Helper v17.8 Image Not Found

Key Information

Metric Value
Customers Affected Any CI runner that is trying to pull the latest released gitlab-runner image version v17.8.0 will fail to find the image (Self-Managed and GitLab.com)
Requests Affected 53,246
Incident Severity severity2
Start Time 2025-01-16 14:30 UTC
End Time 2025-01-16 20:19 UTC
Total Duration
Link to Incident Issue 2025-01-16: GitLab-runner image v17.8.0 not found (#19129 - closed) • Sarah Walker

Summary

The initial issue that raised the incident was that a subset of tags were missing for the helper images. This caused CI/CD pipeline failures across GitLab repositories, and anyone trying to pull registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v17.8.0 which is an image used by GitLab Runner 17.8.0 to run every job.

Screenshot_2025-01-17_at_08.36.06

source

Details

Due to a pipeline refactor, several GitLab-Runner helper images were not deployed as they typically would be:

As we were made away of these issues, some we could rectify manually (ensuring the tags for the images uploaded were complete), others required patching and a pipeline release.

  • v17.8.1 was released 2025-01-17
  • v17.8.2 was released 2025-01-22
  • v17.8.3 was released 2025-01-23

Outcomes/Corrective Actions

  1. We have no way of testing a release. This problem did become apparent during the pipeline refactor that caused the problem, and an issue was already created to resolve:
  2. Windows helper images were hard to use as part of our tests, typically involving a registry. This has now been corrected: gitlab-org/gitlab-runner!5187 (merged)

Learning Opportunities

What went well?

  1. Due to the pipeline refactor, some actions were now very easy to rectify manually. The next step, as per the corrective actions above, is to extract these steps and allow them to run as a separate job.
  2. Due to recent pipeline refactors, a release pipeline now only takes 1hr50m to complete, rather than 3-4hrs.

What was difficult?

  1. Having to make entirely new patch releases of gitlab-runner to fix issues in packages and images was unpleasant. There we no changes to runner itself in all the patch releases, only images and packages.
  2. ☝️ That, and the general maintenance burden of creating packages and images for gitlab-runner is, I (@avonbertoldi) think, a huge distraction for the runner team, who should be working on core runner features, not packaging.

Review Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

For the person opening the Incident Review

  • Set the title to Incident Review: (Incident issue name)
  • Assign a Service::* label (most likely matching the one on the incident issue)
  • Set a Severity::* label which matches the incident
  • In the Key Information section, make sure to include a link to the incident issue
  • Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.
  • Announce the incident review in the incident channel on Slack.
:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.

For the assigned DRI

  • Fill in the remaining fields in the Key Information section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find.
  • If there are metrics showing Customers Affected or Requests Affected, link those metrics in those fields
  • Create a few short sentences in the Summary section summarizing what happened (TL;DR)
  • Use the description section to write a few paragraphs explaining what happened
  • Link any corrective actions and describe any other actions or outcomes from the incident
  • Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?
  • Add any appropriate labels based on the incident issue and discussions
  • Once discussion wraps up in the comments, summarize any takeaways in the details section
  • Close the review before the due date
Edited by Joe Burnett