Review pipeline triage responsibilities and processes to reduce the number of open ~failure::investigating issues
Summary
There are currently 1042 open failure issues with the ~"failure::investigating" label.
I suspect the majority are due to transient issues like Net::ReadTimeout
that don't reflect feature or test bugs and so the issues can be closed. This has been the case for many that I've closed recently.
When a test fails due to an application bug, most of the time the test keeps failing and is picked up by the DRIs on pipeline triage, and so they don't reflect the quality of the code. But these transient issues create unnecessary noise and effort, including for engineering managers.
Proposal
I believe we can solve the problem through improvements to the reporting automation, however in the meantime we might need to dedicate some time and effort to clearing the existing backlog, and adjusting the pipeline triage process and responsibilities to keep the number of issues under control.
@gl-quality Thoughts?
Links
- Test execution dashboards: https://dashboards.quality.gitlab.net/
- Pipeline triage guidelines: https://about.gitlab.com/handbook/engineering/quality/quality-engineering/oncall-rotation
- Test debugging guidelines: https://about.gitlab.com/handbook/engineering/quality/quality-engineering/debugging-qa-test-failures/