Skip to content

Detect timeout problems even when RSpec process did finish successfully

What does this MR do and why?

There are cases where a job times out after the RSpec process finished successfully, e.g. https://gitlab.com/gitlab-org/gitlab/-/jobs/7307042110.

This MR ensures the detection of transient problems is made on the trace excluding the "body section" (i.e. the RSpec results, RuboCop, Workhorse sections etc. naming is hard!).

It also changes the order with which detect transient problems, to reduce the likelihood of detecting 500 Internal Server Error (this could be present in a test description) before ERROR: Job failed: execution took longer than.

Note that we could still in theory detect a wrong master-brokeninfrastructure problem (because of 500 Internal Server Error present in a test description), but given 500 Internal Server Error is only present in a single test in the whole test suite, I think detecting timing out jobs is more important that handling the edge-case caused by a single test.

Related to #1493 (closed), gitlab-org/quality/engineering-productivity/master-broken-incidents#7303 (closed).

Expected impact & dry-runs

These are strongly recommended to assist reviewers and reduce the time to merge your change.

See https://gitlab.com/gitlab-org/quality/triage-ops/-/tree/master/doc/scheduled#testing-policies-with-a-dry-run on how to perform dry-runs for new policies.

See https://gitlab.com/gitlab-org/quality/triage-ops/-/blob/master/doc/reactive/best_practices.md#use-the-sandbox-to-test-new-processors on how to make sure a new processor can be tested.

Action items

Edited by Rémy Coutable

Merge request reports

Loading