Skip to content

Create better alerting when test failures indicate system failures

Problem

From Identify root causes of infrastructure issues (#3850 - closed), we concluded that spikes in test failures that led to deployment blockages happened during an incident. When these incidents happens, we should communicate this clearly to release managers that these test failures are due to environment's instabilities, not the tests themselves. This can help release managers to save time from going through pipeline DRIs for failure analysis, and instead can raise an incident right away.

Proposal

We can leverage the mechanism we have in place where we group similar failures into 1 issue - example issue https://gitlab.com/gitlab-org/quality/e2e-test-issues/-/issues/1231+ - by posting this issue to corresponding pipeline thread in #announcements slack channel.

Outcome

  • Help pipeline DRIs and Delivery to quickly identify that there's a more widespread issue in the environment or across pipelines rather than a flaky test or tests, or a one-off pipeline flaking out.
  • Improve release managers' workflow and time to react to these type of widespread failures.
  • Better streamline the nature of this type of failures to help improve deployment blockage reporting.
Edited by Tiffany Rea