The time taken to resolve broken tests on branches is painful during a release management cycle
Problem Statement
Release Managers must stay on top of test failures of multiple branches, spread across multiple repos, multiple GitLab instances, for a wide array of reasons, and this is only going to become much worse with upcoming feature work to enable ourselves to support #2618 (closed) and https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2645
15.5.x security release Functional Example
During the most recent security release, we suffered a failure on just one branch, the FOSS 15-5-stable
branch. Here's the timeline for fixing it:
- 11:51UTC - all security picks are merged into the
15-5-stable-ee
branch - 14:41UTC - handover to next AMER RM
- 15:16UTC - RM retries suspected flakey tests - https://dev.gitlab.org/gitlab/gitlabhq/-/pipelines/258125
- 16:35UTC - RM engages development as the tests continued to fail
- 16:37UTC - known flaky test and fix are cherry-picked into an MR targeting
15-5-stable-ee
- gitlab-org/gitlab!102667 (merged) - 18:06UTC - that MR is failing tests because of a new flakey test - retries did not help
- 18:18UTC - MR to quarantine the second test is created in an MR on Canonical Master - gitlab-org/gitlab!102676 (merged)
- 19:13UTC - second MR is merged, then cherry-picked into the first MR
- 21:15UTC - first MR is merged - now we wait for
15-5-stable-ee
and15-5-stable
to be green yet again - 22:09UTC - handover to APAC RM
- 07:07UTC next Day - handover to EMEA RM
- 07:25UTC next Day - RM discovers the
15-5-stable
branch is still red with yet a differing failure - retried - https://dev.gitlab.org/gitlab/gitlabhq/-/pipelines/258187 - 08:13UTC next Day - branches are reported to be green
One may question how long it takes for us to notice these failures and act upon them. It must be noted that we do not have any active alerts that yell at us when a failure is detected. Unless persons have modified their workflow to watch for pipelines, it's difficult for these failures to be identified in a timely manner.
Potential Solution(s)
- During an active release cycle, consider how to notify the release manager of the active branches we must be monitoring as they change state
- Consider creating some sort of proactive capability to test stable branches ahead of time - as evident from the above, flakey tests become flakey even after a release had been at one point in time finalized. Perhaps we can test ahead of a certain checkpoint of our procedure to get ahead of failures, reducing stress as a deadline for release nears
- Flaky tests are seemingly always related to testing the UI, specifically looking for target DOM elements - work with QA to evaluate usefulness and/or alternatives with the intent to reducing the amount of flaky tests that come about
- ...
Milestones
-
Additional evidence and/or ideas are gathered -
Issues created to address -
Retro this issue 1 year from now