Runner randomly getting stuck in "preparing" state
Summary
Runner sometimes gets stuck in a "preparing" state and hangs until it's manually canceled. This is happening while using the docker+machine executor on runner version v16.8.1, but it also happens on runner version v16.5.0. Occurs when spinning up a very large number of jobs (500-1000) simultaneously and only after it completes a large number of those jobs (200-400) does the problem start.
Steps to reproduce
- Given docker+machine executor on Runner version v16.8.1
- Use the docker+machine executor to spin up 1000 VMs to take jobs at once for a single pipeline
- Observe Runner will sometimes become stuck in a "preparing" state after completing a large number of those jobs (200-400)
Example Project
What is the current bug behavior?
Current behavior is that Runner will sometimes get stuck in a "preparing" state until manually canceled.
What is the expected correct behavior?
Runner should cancel stuck jobs without manual intervention.
Relevant logs and/or screenshots
Logs reveal the following repeated over and over:
Mar 19 16:27:11 wafrunner-research-xyltgp-vm gitlab-runner[501730]: Updating job... bytesize=378 checksum=crc32:d85955d2 job=82787261 runner=QEzytxJx
Mar 19 16:27:11 wafrunner-research-xyltgp-vm gitlab-runner[501730]: Submitting job to coordinator...ok bytesize=378 checksum=crc32:d85955d2 code=200 job=82787261 job-status= runner=QEzytxJx update-interval=0s
Mar 19 16:28:11 wafrunner-research-xyltgp-vm gitlab-runner[501730]: Updating job... bytesize=378 checksum=crc32:d85955d2 job=82787261 runner=QEzytxJx
Mar 19 16:28:11 wafrunner-research-xyltgp-vm gitlab-runner[501730]: Sleeping due to rate limit context=ratelimit-requester-gitlab-request duration=31m48.733575858s method=PUT url=https://gitlab.f5net.com/api/v4/jobs/82787261
Mar 19 17:00:00 wafrunner-research-xyltgp-vm gitlab-runner[501730]: WARNING: Submitting job to coordinator... failed bytesize=378 checksum=crc32:d85955d2 code=-1 job=82787261 job-status= runner=QEzytxJx status=couldn't execute PUT against https://gitlab.f5net.com/api/v4/jobs/82787261: Put "https://gitlab.f5net.com/api/v4/jobs/82787261": http: ContentLength=888 with Body length 0 update-interval=0s
Mar 19 17:01:00 wafrunner-research-xyltgp-vm gitlab-runner[501730]: Updating job... bytesize=378 checksum=crc32:d85955d2 job=82787261 runner=QEzytxJx
Mar 19 17:01:00 wafrunner-research-xyltgp-vm gitlab-runner[501730]: Submitting job to coordinator...ok bytesize=378 checksum=crc32:d85955d2 code=200 job=82787261 job-status= runner=QEzytxJx update-interval=0s
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)