Improve systemic errors detection by looking at the first backtrace line
What does this MR do and why?
We now look at the first line of the error backtrace to group errors, and better detect systemic errors (i.e. errors that happen systemically after a certain point due to, most probably runner environment issue, e.g. resources are exhausted, PG doesn't have enough memory etc.).
Note: There's a risk that "legit" failure messages that are a bit generic, e.g. Failure/Error: expect(job).to be_successful(timeout: 400)
might be detected as systemic with this change. I'll open an MR to allow to customize the systemic error detection threshold.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Edited by Rémy Coutable