The time taken to resolve broken tests on branches is painful during a release management cycle

Problem Statement

Release Managers must stay on top of test failures of multiple branches, spread across multiple repos, multiple GitLab instances, for a wide array of reasons, and this is only going to become much worse with upcoming feature work to enable ourselves to support #2618 (closed) and https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2645

15.5.x security release Functional Example

During the most recent security release, we suffered a failure on just one branch, the FOSS 15-5-stable branch. Here's the timeline for fixing it:

11:51UTC - all security picks are merged into the 15-5-stable-ee branch
14:41UTC - handover to next AMER RM
15:16UTC - RM retries suspected flakey tests - https://dev.gitlab.org/gitlab/gitlabhq/-/pipelines/258125
16:35UTC - RM engages development as the tests continued to fail
16:37UTC - known flaky test and fix are cherry-picked into an MR targeting 15-5-stable-ee - gitlab-org/gitlab!102667 (merged)
18:06UTC - that MR is failing tests because of a new flakey test - retries did not help
18:18UTC - MR to quarantine the second test is created in an MR on Canonical Master - gitlab-org/gitlab!102676 (merged)
19:13UTC - second MR is merged, then cherry-picked into the first MR
21:15UTC - first MR is merged - now we wait for 15-5-stable-ee and 15-5-stable to be green yet again
22:09UTC - handover to APAC RM
07:07UTC next Day - handover to EMEA RM
07:25UTC next Day - RM discovers the 15-5-stable branch is still red with yet a differing failure - retried - https://dev.gitlab.org/gitlab/gitlabhq/-/pipelines/258187
08:13UTC next Day - branches are reported to be green

One may question how long it takes for us to notice these failures and act upon them. It must be noted that we do not have any active alerts that yell at us when a failure is detected. Unless persons have modified their workflow to watch for pipelines, it's difficult for these failures to be identified in a timely manner.

Potential Solution(s)

During an active release cycle, consider how to notify the release manager of the active branches we must be monitoring as they change state
Consider creating some sort of proactive capability to test stable branches ahead of time - as evident from the above, flakey tests become flakey even after a release had been at one point in time finalized. Perhaps we can test ahead of a certain checkpoint of our procedure to get ahead of failures, reducing stress as a deadline for release nears
Flaky tests are seemingly always related to testing the UI, specifically looking for target DOM elements - work with QA to evaluate usefulness and/or alternatives with the intent to reducing the amount of flaky tests that come about
...

Milestones

Additional evidence and/or ideas are gathered
Issues created to address
Retro this issue 1 year from now

cc @gitlab-org/delivery