StuckCiJobsWorker wrongly detects, cancels 'stuck' builds when per-job timeout is more than an hour
Summary
When builds are allowed to take more than 60 minutes, builders on long builds can be marked as offline when they're actually online; and waiting builds can be cancelled as stuck when they're not stuck.
Steps to reproduce
- Get a Gitlab CE instance with the default configuration. I did it like so, on an ec2 m3.medium with ami-7842030f:
sudo apt-get update; curl -sS https://packages.gitlab.com/install/repositories/gitlab/gitlab-ce/script.deb.sh | sudo bash; sudo apt-get install -y gitlab-ce; curl -sSL https://get.docker.com/ | sh; curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-ci-multi-runner/script.deb.sh | sudo bash; sudo apt-get -y install gitlab-ci-multi-runner; sudo apt-get upgrade -y;
sudo gitlab-ctl reconfigure
sudo gitlab-runner register
- Make a project with a single, dedicated runner (and no shared runners).
- Set the per-job timeout to 90 minutes. Make the builds take 85 minutes. Deselect 'Auto-cancel redundant, pending pipelines' or create two branches so you can start two builds.
- Start two builds at 10:50
- The second build will be cancelled at 12:00.
Example Project
I attempted to reproduce this on gitlab.com but was unsuccessful; I had to install my own Gitlab CE instance.
My project had a single file created with http://54.170.105.106/root/gitlab-ci-test/new/master?commit_message=Add+.gitlab-ci.yml&file_name=.gitlab-ci.yml
image: ubuntu:16.04
build:
stage: build
script:
- echo "Hello world"
- sleep 85m
- echo "Done sleeping."
What is the current bug behavior?
- Two builds are queued at 10:50; one is picked up by the runner, the other waits.
- At 11:50, the runner is marked
Runner is offline, last contact was about 1 hour ago
- At 12:00, the second build is cancelled as it's stuck.
- At 12:15, the first build completes and the runner is marked online again.
What is the expected correct behavior?
- Two builds are queued at 10:50; one is picked up by the runner, the other waits.
- At 12:15, the first build completes and the second build starts.
- At 13:40, the second build completes.
Relevant logs and/or screenshots
Jobs 2 and 3 being cancelled, on the hour, while job 1 is running:
Started PATCH "/api/v4/jobs/1/trace" for 127.0.0.1 at 2017-09-22 17:59:57 +0000
StuckCiJobsWorker: Cleaning stuck builds
StuckCiJobsWorker: Dropping stuck build 2 for runner (status: pending, timeout: 3600)
StuckCiJobsWorker: Dropping stuck build 3 for runner (status: pending, timeout: 3600)
<snip mailing and metrics>
Started PATCH "/api/v4/jobs/1/trace" for 127.0.0.1 at 2017-09-22 18:00:27 +0000
Gitlab API project jobs:
[
{
"id": 3,
"status": "failed",
"stage": "build",
"name": "build",
"ref": "master",
"tag": false,
"coverage": null,
"created_at": "2017-09-22T16:56:47.658Z",
"started_at": null,
"finished_at": "2017-09-22T18:00:09.595Z",
"user": <snip>
"commit": <snip>
"runner": null,
"pipeline": {
"id": 3,
"sha": "19adf1b443ff9fcb9ca6ce514d64ba48aa4c2d89",
"ref": "master",
"status": "failed"
}
},
{
"id": 2,
"status": "failed",
"stage": "build",
"name": "build",
"ref": "master",
"tag": false,
"coverage": null,
"created_at": "2017-09-22T16:56:36.026Z",
"started_at": null,
"finished_at": "2017-09-22T18:00:09.506Z",
"user": <snip>
"commit": <snip>
"runner": null,
"pipeline": {
"id": 2,
"sha": "19adf1b443ff9fcb9ca6ce514d64ba48aa4c2d89",
"ref": "master",
"status": "failed"
}
},
{
"id": 1,
"status": "running",
"stage": "build",
"name": "build",
"ref": "master",
"tag": false,
"coverage": null,
"created_at": "2017-09-22T16:56:23.660Z",
"started_at": "2017-09-22T16:56:24.923Z",
"finished_at": null,
"user": <snip>
"commit": <snip>
"runner": {
"id": 1,
"description": "ip-10-98-181-67",
"active": true,
"is_shared": false,
"name": "gitlab-ci-multi-runner"
},
"pipeline": {
"id": 1,
"sha": "19adf1b443ff9fcb9ca6ce514d64ba48aa4c2d89",
"ref": "master",
"status": "running"
}
}
]
Output of checks
(If you are reporting a bug on GitLab.com, write: This bug happens on GitLab.com)
Results of GitLab environment info
Expand for output related to GitLab environment info
ubuntu@ip-10-98-181-67:/var/log/gitlab$ sudo gitlab-rake gitlab:env:infoSystem information System: Ubuntu 14.04 Current User: git Using RVM: no Ruby Version: 2.3.5p376 Gem Version: 2.6.13 Bundler Version:1.13.7 Rake Version: 12.0.0 Redis Version: 3.2.5 Git Version: 2.13.5 Sidekiq Version:5.0.4 Go Version: unknown
GitLab information Version: 10.0.0 Revision: bbf5c73 Directory: /opt/gitlab/embedded/service/gitlab-rails DB Adapter: postgresql URL: http://ec2-54-170-105-106.eu-west-1.compute.amazonaws.com HTTP Clone URL: http://ec2-54-170-105-106.eu-west-1.compute.amazonaws.com/some-group/some-project.git SSH Clone URL: git@ec2-54-170-105-106.eu-west-1.compute.amazonaws.com:some-group/some-project.git Using LDAP: no Using Omniauth: no
GitLab Shell Version: 5.9.0 Repository storage paths:
- default: /var/opt/gitlab/git-data/repositories Hooks: /opt/gitlab/embedded/service/gitlab-shell/hooks Git: /opt/gitlab/embedded/bin/git
Results of GitLab application Check
Expand for output related to the GitLab application check
ubuntu@ip-10-98-181-67:/var/log/gitlab$ sudo gitlab-rake gitlab:check SANITIZE=true Checking GitLab Shell ...GitLab Shell version >= 5.9.0 ? ... OK (5.9.0) Repo base directory exists? default... yes Repo storage directories are symlinks? default... no Repo paths owned by git:root, or git:git? default... yes Repo paths access is drwxrws---? default... yes hooks directories in repos are links: ... 1/1 ... ok Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Check GitLab API access: OK Redis available via internal API: OK
Access to /var/opt/gitlab/.ssh/authorized_keys: OK gitlab-shell self-check successful
Checking GitLab Shell ... Finished
Checking Sidekiq ...
Running? ... yes Number of Sidekiq processes ... 1
Checking Sidekiq ... Finished
Reply by email is disabled in config/gitlab.yml Checking LDAP ...
LDAP is disabled in config/gitlab.yml
Checking LDAP ... Finished
Checking GitLab ...
Git configured correctly? ... yes Database config exists? ... yes All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config up to date? ... yes Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory exists? ... yes Uploads directory has correct permissions? ... yes Uploads directory tmp has correct permissions? ... skipped (no tmp uploads folder yet) Init script exists? ... skipped (omnibus-gitlab has no init script) Init script up-to-date? ... skipped (omnibus-gitlab has no init script) Projects have namespace: ... 1/1 ... yes Redis version >= 2.8.0? ... yes Ruby version >= 2.3.3 ? ... Exception: undefined method
run_command' for SystemCheck::App::RubyVersionCheck:Class Did you mean? run_commands Git version >= 2.7.3 ? ... Exception: undefined method
run_command' for SystemCheck::App::GitVersionCheck:Class Did you mean? run_commands Git user has default SSH configuration? ... yes Active users: ... 1Checking GitLab ... Finished
Possible fixes
- Increase offline runner / stuck build thresholds when job time is increased.
- Count 'PATCH' calls as activity for purposes of detecting offline runners.
- Location of cancel: https://gitlab.com/gitlab-org/gitlab-ce/blob/master/app/workers/stuck_ci_jobs_worker.rb#L54