Various pipeline testing jobs are timing out
Problem
Over the last week, a number of pipelines have had a single job or multiple jobs fail. This seems to be increasing in higher frequency but I'm not aware of a way to capture these errors in Sentry or Kibana. In an effort to try to quantify this, I identified the last 100 failing pipelines against master
and sifted through the failing job traces to see where they contained "Job failed: execution took longer than"
Results of last 100 failed master pipelines as of 2019-12-07 02:06 UTC
21 out of the last 100 failed master pipelines had a job fail due to timeout. Removing the schedule:package-and-qa
failures leaves 6 pipelines - This does not include pipelines which had jobs that fail due to getting stuck without a job trace like: https://gitlab.com/gitlab-org/gitlab/-/jobs/369736296.
- Pipeline 101298263 at 2019-12-06T20:14:57.325Z
- Pipeline 101199368 at 2019-12-06T13:34:12.424Z
- Pipeline 101110839 at 2019-12-06T07:45:37.042Z
- Pipeline 100806780 at 2019-12-05T06:32:32.808Z
- Pipeline 100687068 at 2019-12-04T17:42:58.732Z
- Pipeline 100541391 at 2019-12-04T09:50:35.693Z
I've seen this in higher frequency and not sure what's going on here. It does not seem correlated to a system outage that I can find.