Proposal: Introduce a "job duration analysis" section to master broken incidents for rspec jobs

Proposal

To utlizae automation and help our team troubleshoot the cause of job timeout errors, I am proposing a new section to the master broken incident issue if an rspec job fails.

This idea came from what I would do manually when I see a timed-out rspec job. I find the following steps helpful to get started with troubleshooting timeout issues:

look at the master knapsack report
calculate the expected duration of this job by listing the expected duration of each test file predicted by Knapsack
compare the duration prediction with the actual duration of the test files, if the job produced an artifact, using the Knapsack visualization tool documented in the runbook.

If a job is timed out and fails to produce an artifact, it can still be helpful to understand how long the job was predicted to take, so in this case, we list the expected duration of each test file without the actual durations.

This information can be beneficial in understanding the effectiveness of Knapsack parallelization, for example, gives us some clue as to whether there were too many tests being allocated to the job, causing the job to be overloaded. Similarly, it can also help us rule out factors by confirming that Knapsack did its jobs correctly, and the issue came from somewhere else.

See my steps in master-broken-incidents#4202 (comment 1652862762) as a small example of how I tried to gather this information manually.

The size of this new comment can be large, so I plan to utilize collapsible sections. It will look something like the following (still working on the detailed format)

Job Duration Analysis

link to master knapsack report (expected)

link to job knapsack report (actual)

all tests are expected take: total expected duration, and actually took actual duraton.

expected slowest spec: 'spec_file_1'
actual slowest spec: 'spec_file_1'

Expected test duration per file:

Click to expand

{ 'spec_file_1': 1234, 'spec_file_2' : 1234 ..}

actual spec duration per file:

Click to expand

{ 'spec_file_1': 1234, 'spec_file_2' : 1234 ..}

I'm thinking of adding this into a comment instead of putting it in the body of the issue, in case it causes too much clutter. One thought I had was to replace the root cause analysis section with this new duration analysis since the RCA thread is left empty in most of the incidents that I see.

@gl-quality/eng-prod please let me know what you think, thanks!

Action Items (Updated on Dec 21)

Thanks to team's feedback, I have gathered the following todo lists from this proposal:

Add expected and actual spec file duration data in master-broken incident issues.
Print the actual spec run time in the job log, this involves achieving some or all of:
1. actual knapsack report file at ~85 min mark
2. file runtime info and total time up to now after each spec file
3. (Stretch) runtime info after each test case
[ ] (Abandoned) Raise an alert (we can start with Slack alerts) if a master pipeline (non-scheduled) runs a test that is not found in the master-report and was not added in the latest commit. If the alert is deemed useful, consider failing the pipeline when this happens.

Next iteration

Create issue using https://gitlab.com/gitlab-org/ruby/gems/gitlab_quality-test_tooling when a specific spec is causing the job to significantly slow down or time out (spec taking more than 40% of duration, and is significantly slower than expected)

Stretch

Create issue for non-rspec job timeouts via https://gitlab.com/gitlab-org/ruby/gems/gitlab_quality-test_tooling
Create issue when Knapsack scheduling contributed to job timeouts (when none of the tests are running significantly longer than expected, but job still timed-out)

Edited Dec 21, 2023 by Jennifer Li