Establish a process for broken GitLab releases
Context
GitLab patch releases are scheduled twice a month to meet the Releases SLO for bug and vulnerability fixes, the regular cadence provides predictability inside GitLab and therefore stability to our customers, however, the planned patch release process doesn't account for outer and extreme situations, particularly when GitLab releases are broken and therefore not usable for GitLab customers.
During the last patch release for 16.10, 16.9 and 16.8, GitLab released broken packages preventing customers from upgrading to the newest version gitlab-org/omnibus-gitlab#8488 (closed)
Broken releases impact GitLab from several angles:
- By releasing corrupted packages to the public, GitLab's credibility is damaged.
- Complexity is added to the upgrade process which dissatisfied GitLab customers.
- GitLab Dedicated upgrade processes are halted until a fix is in place https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/4594#note_1860295622
- Enforce reactive release work, release managers had to act quickly to guarantee the bug didn't impact the monthly release.
Because of the above release managers expedited an urgent patch release. This is not the first time a broken build has been released, on December 2023, broken builds were released to the public forcing another expedited patch release process https://gitlab.com/gitlab-org/release/tasks/-/issues/8117.
Proposal: Establish a release process for broken releases
Engineering processes should be in place to catch these errors before releases are published, however, we need to have critical processes available to us that allow us to pull an emergency lever when broken releases are published. A definition and runbook should be available for release managers that dictates the steps to follow in case broken builds are detected.
Broken releases definition
A "Broken Release" is a release with a complete failure state, preventing customers from starting/installing correctly or upgrading from a previous version of GitLab.
Runbook
From experience, the steps to follow:
- Declare a production incident
- Along with the EOC, determine if a quick workaround is available. A domain expert will be required.
- If the workaround is available, it should be documented in the incident issue and on any other issue reported by customers
- Ensure the culprit issue is not included in the monthly release
- Determine if it is possible to take down the packages from the package cloud - This will depend on the content of the package, if it contains high-security fixes this should not be an option
- Work towards an expedited patch release process to be published as soon as possible.
- Request a retrospective for the incident, it is important to understand how this happened and the actions to take to prevent it.
It is important to highlight that the expedited patch release process is limited to unusable/broken releases, bug and security fixes must adhere to the GitLab SLOs.