(2023-12-13) Job timeout investigation
Context
We have more job timeouts than usual, and we'd like to know why.
We started investigating in gitlab-org/quality/engineering-productivity/team#323 (closed), but to not blow up the scope of the issue, we've extracted the ongoing work out of the issue into a separate EPIC.
Goal
- Understand why CI/CD jobs are timing out
How to investigate?
Here's my process to date:
- Get some jobs that took more than 80 minutes:
# ci_job_timeouts is coming from
# https://gitlab.com/splattael/gitlab-tools/-/blob/master/bin/ci_job_timeouts.rb,
#
# and I copied it in my /usr/local/bin folder
ci_job_timeouts gitlab-org/gitlab "rspec.*" 80 > timeouts.txt
- Pick the first 20 jobs, and have a look at their output
What to look for in the output?
Here's what I do there to get an idea of the timing pattern:
- Copy the raw output in your favorite editor
- Search for
took
, and add a cursor for each line - Copy all the lines containing
took
to another file, and analyse what you see
Edited by David Dieulivol