Scheduled GitLab Runner installation test workflow

Summary

The initial issue that raised the incident was that a subset of tags were missing for the helper images. This caused CI/CD pipeline failures across GitLab repositories, and anyone trying to pull registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-v17.8.0 which is an image used by GitLab Runner 17.8.0 to run every job.

We do not currently have a way to detect when a bad release breaks GitLab Runner installation. This remediation action adds a detection mechanism by scheduling a production installation test. Failure of the test workflow will result in an alert for the relevant oncall.

Related Incident(s)

Desired Outcome/Acceptance Criteria

  • Once every 8 hours a scheduled workflow runs a fresh GitLab Runner installation for each distribution method (deb, rpm and image) using the current production sources.
  • A failure of the scheduled installation test will alert the relevant oncall.

Associated Services

ServiceCI Runners in GitLab.com / GitLab Infrastructure Team / Production Engineering

Corrective Action Issue Checklist

  • Link the incident(s) this corrective action arose from
  • Give context for what problem this corrective action is trying to prevent re-occurring
  • Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
  • Assign a priority (this will default to 'Production Engineering::P4' but should match the severity of the related incident)
  • Assign a service label
  • Assign a team label
Edited by Joe Burnett