Retrospective: Multiple failures when preparing the release candidate for 15.11

Context

In the run-up to the 22nd (usually two days before), a release candidate is identified and tagged to be later used as the base for the final release version. During the 15.11 preparation, multiple failures on the GitLab project dev instance (dev.gitlab.org) were encountered when tagging the release candidate. These failures were not present on Canonical nor Security and they had been present multiple days before the release candidate.

Encountering problems when tagging the release candidate is troublesome due to time constraints: Release managers have limited time to identify the problem, find a solution and continue with the preparation for the 22nd.

Timeline

2023-04-20

11:46 UTC - RC42 is tagged
12:46 UTC - 4 different failures on the dev stable branches are discovered:
12:48 UTC - It is noticed the failures are not present on Canonical or Security
15:08 UTC - A different problem was discovered with the RC (breaking change accidentally included). Another RC is required at this point
19:01 UTC - RC43 is tagged
19:22 UTC - The root cause for most of the dev failures is found gitlab-org/gitlab!117262 (merged). A fix is prepared gitlab-org/gitlab!118261 (merged) and backported to the 15.11 stable branch gitlab-org/gitlab!118263 (merged)
19:40 UTC - RC43 contains the same failures for RC42, so a new one is required.
21:38 UTC - RC44 is tagged.
23:47 UTC - Most of the failures were fixed by gitlab-org/gitlab!118263 (merged). Only one remains

2023-04-21

00:16 UTC - Based on the successful status of the last pre deployment, release managers decide to continue with the release preparation despite the failure. The remaining failure is reported on gitlab-org/gitlab#408305 (closed)

Discussions

How can dev failures be surfaced earlier? - The 15.11 release was at risk for a failure that has been present on dev for at least two weeks.
What should be the process for dev failures? - At the moment, dev pipeline errors trigger a message on the #broken-master-mirrors Slack channel, but the channel only has limited participants and there is no clear DRI or process for the failures.
Should the release process rely on dev projects to determine the release candidate status? - With dev failures not being treated on time, we might want to consider using Security or Canonical repositories instead. One advantage of dev, is that it is a CE instance, failures there could represent legit errors for CE self-managed installations.

Edited Apr 27, 2023 by Mayra Cabrera