www-gitlab-com jobs stalling and not running
Overview
From discussions in Slack (https://gitlab.slack.com/archives/C101F3796/p1594074319135800):
Some CI jobs in the www-gitlab-com project are stalling at startup, not completing, and timing out after nearly 2 hours, thus stalling the merge train and generally being annoying.
An example: https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/626293506
This job ID does not show up in any logs, and yet clearly private-runners-manager-4.gitlab.com has claimed it in some fashion. The gitlab-runner process on that machine was restarted (possibly upgraded, although the binary is from 2020-07-01) at 2020-07-06 by @steveazz .
The only evidence of anything particularly unusual and recent I can find is that since 2020-07-06 14:33 UTC the runner managers have been seeing DeadlineExceeded errors on calls to /api/v4/jobs/request: https://log.gprd.gitlab.net/goto/0b79912bac70f4b393f06df1cb3eb4d9 But that is on requests for new jobs; leaving aside why that calls into gitaly and times out, looking for tags, it doesn't seem likely directly related to this issue, particularly as it does successfully log "Checking for jobs... failed".
I'm out of good ideas (and only have bad ones, like using strace and hoping to get lucky). It's a relatively low rate of errors in a firehose of things that are working fine. I've asked in #g_runner in Slack, maybe we'll get something from there.
It does not, as far as I've been able to tell, affect any other projects. Or at least: no-one is reporting it, and I've not seen anything where I've looked.
Update 2020-07-07 16:15 UTC
- In https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10749#note_374814509 we confirmed that rails assigns a Runner to the job.
- When the Runner asks for the job it gets a 500 error, as shown in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10749#note_374941812
- We end up in a state where rails think that the job is running, but the Runner was never given a job.
- The job either is canceled by the user or gets cleaned up by the
StuckCiJobsWorker
which cleans up stale jobs every hour