Detect timeout problems even when RSpec process did finish successfully (!2934) · Merge requests · GitLab.org / Quality Department / triage-ops

Rémy Coutable requested to merge detect-timeout-issues-even-when-job-ran-rspec into master Jul 10, 2024

What does this MR do and why?

There are cases where a job times out after the RSpec process finished successfully, e.g. https://gitlab.com/gitlab-org/gitlab/-/jobs/7307042110.

This MR ensures the detection of transient problems is made on the trace excluding the "body section" (i.e. the RSpec results, RuboCop, Workhorse sections etc. naming is hard!).

It also changes the order with which detect transient problems, to reduce the likelihood of detecting 500 Internal Server Error (this could be present in a test description) before ERROR: Job failed: execution took longer than.

Note that we could still in theory detect a wrong master-brokeninfrastructure problem (because of 500 Internal Server Error present in a test description), but given 500 Internal Server Error is only present in a single test in the whole test suite, I think detecting timing out jobs is more important that handling the edge-case caused by a single test.

Expected impact & dry-runs

These are strongly recommended to assist reviewers and reduce the time to merge your change.

See https://gitlab.com/gitlab-org/quality/triage-ops/-/tree/master/doc/scheduled#testing-policies-with-a-dry-run on how to perform dry-runs for new policies.

See https://gitlab.com/gitlab-org/quality/triage-ops/-/blob/master/doc/reactive/best_practices.md#use-the-sandbox-to-test-new-processors on how to make sure a new processor can be tested.

Action items

If adding environment variables for reactive processors, update config/triage-web.yaml and .gitlab/ci/triage-web.yml
(If applicable) Add documentation to the handbook pages for Triage Operations =>
(If applicable) Identify the affected groups and how to communicate to them:
- /cc @person_or_group =>
- Relevant Slack channels =>
- Engineering week-in-review

Edited Jul 10, 2024 by Rémy Coutable

Detect timeout problems even when RSpec process did finish successfully

What does this MR do and why?

Expected impact & dry-runs

Action items

Merge request reports