Introduce process for handling merge failures on Dev
Context
In preparation for the monthly release, the Delivery team depends on a green master on the GitLab project on the dev instance. A green master status on this project gives release managers confidence that release candidate packages can be built and that are ready to be released on the 22nd.
Failures on the GitLab dev project lack visibility, at the moment when a failure occurs a message is posted on the #master-broken-mirrors
Slack channel with no guidelines on how to address it, as an aside this channel also has a very low number of participations so it is unusual for someone to take action on the failure.
Problem
During the final steps of the preparation for the 16.1 release, release managers discovered that the default branch of the GitLab dev project has been broken for at least a week production#15901 (closed). Noticing that failure during the release run-up is not ideal: release managers have to rush to understand the failure, find the root cause, mitigate it, and continue with the next steps for the release preparation, all in a limited time window (less than 24hrs).
This is not the first time it happens, a similar scenario during the 15.11 release preparation #4606 (closed)
Timeline
All times in UTC
-
2023-06-17 11:58
- gitlab-org/gitlab!122308 (merged) was merged. During production#15901 (closed) it was believed the MR was the possible culprit, later it was found out the failures were associated with a runner configuration change. -
2023-06-17 13:00
- Failure is present on dev https://dev.gitlab.org/gitlab/gitlab-ee/-/commit/e0a3414fab23fe384a0df24661cc268ef2f820e0. Merge request author is not notified -
2023-06-20 21:48
- Failures on GitLab dev project are discovered -
2023-06-20 22:00
- production#15901 (closed) is created -
2023-06-20 22:59
- Two different failures: One associated with Zoekt network issues (gitlab-org/gitlab!122308 (merged) as a possible culprit) and another one for database migration timeouts on specs. -
2023-06-21 02:55
- The root cause is discovered production#15901 (comment 1439119289) -
2023-06-21 07:24
- Incident is mitigated and monthly release preparation is unblocked
Questions
The purpose of this issue is to address the following questions:
- With legit merge request failures on dev, how can these failures be surfaced earlier to developers?
- Are merge request authors the right person to address the failures, if not, what team is the DRI for the GitLab dev failures?
- What team is the owner of the GitLab dev project?
- Are the specs of the GitLab dev project useful?