Skip to content

StuckCiJobsWorker wrongly detects, cancels 'stuck' builds when per-job timeout is more than an hour

Summary

When builds are allowed to take more than 60 minutes, builders on long builds can be marked as offline when they're actually online; and waiting builds can be cancelled as stuck when they're not stuck.

Steps to reproduce

  1. Get a Gitlab CE instance with the default configuration. I did it like so, on an ec2 m3.medium with ami-7842030f:
sudo apt-get update; curl -sS https://packages.gitlab.com/install/repositories/gitlab/gitlab-ce/script.deb.sh | sudo bash; sudo apt-get install -y gitlab-ce; curl -sSL https://get.docker.com/ | sh; curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-ci-multi-runner/script.deb.sh | sudo bash; sudo apt-get -y install gitlab-ci-multi-runner; sudo apt-get upgrade -y;
sudo gitlab-ctl reconfigure
sudo gitlab-runner register
  1. Make a project with a single, dedicated runner (and no shared runners).
  2. Set the per-job timeout to 90 minutes. Make the builds take 85 minutes. Deselect 'Auto-cancel redundant, pending pipelines' or create two branches so you can start two builds.
  3. Start two builds at 10:50
  4. The second build will be cancelled at 12:00.

Example Project

I attempted to reproduce this on gitlab.com but was unsuccessful; I had to install my own Gitlab CE instance.

My project had a single file created with http://54.170.105.106/root/gitlab-ci-test/new/master?commit_message=Add+.gitlab-ci.yml&file_name=.gitlab-ci.yml

image: ubuntu:16.04
build:
  stage: build
  script: 
    - echo "Hello world"
    - sleep 85m
    - echo "Done sleeping."

What is the current bug behavior?

  1. Two builds are queued at 10:50; one is picked up by the runner, the other waits.
  2. At 11:50, the runner is marked Runner is offline, last contact was about 1 hour ago
  3. At 12:00, the second build is cancelled as it's stuck.
  4. At 12:15, the first build completes and the runner is marked online again.

What is the expected correct behavior?

  1. Two builds are queued at 10:50; one is picked up by the runner, the other waits.
  2. At 12:15, the first build completes and the second build starts.
  3. At 13:40, the second build completes.

Relevant logs and/or screenshots

Jobs 2 and 3 being cancelled, on the hour, while job 1 is running:

Started PATCH "/api/v4/jobs/1/trace" for 127.0.0.1 at 2017-09-22 17:59:57 +0000
StuckCiJobsWorker: Cleaning stuck builds
StuckCiJobsWorker: Dropping stuck build 2 for runner  (status: pending, timeout: 3600)
StuckCiJobsWorker: Dropping stuck build 3 for runner  (status: pending, timeout: 3600)
<snip mailing and metrics>
Started PATCH "/api/v4/jobs/1/trace" for 127.0.0.1 at 2017-09-22 18:00:27 +0000

Gitlab API project jobs:

[
  {
    "id": 3,
    "status": "failed",
    "stage": "build",
    "name": "build",
    "ref": "master",
    "tag": false,
    "coverage": null,
    "created_at": "2017-09-22T16:56:47.658Z",
    "started_at": null,
    "finished_at": "2017-09-22T18:00:09.595Z",
    "user": <snip>
    "commit": <snip>
    "runner": null,
    "pipeline": {
      "id": 3,
      "sha": "19adf1b443ff9fcb9ca6ce514d64ba48aa4c2d89",
      "ref": "master",
      "status": "failed"
    }
  },
  {
    "id": 2,
    "status": "failed",
    "stage": "build",
    "name": "build",
    "ref": "master",
    "tag": false,
    "coverage": null,
    "created_at": "2017-09-22T16:56:36.026Z",
    "started_at": null,
    "finished_at": "2017-09-22T18:00:09.506Z",
    "user": <snip>
    "commit": <snip>
    "runner": null,
    "pipeline": {
      "id": 2,
      "sha": "19adf1b443ff9fcb9ca6ce514d64ba48aa4c2d89",
      "ref": "master",
      "status": "failed"
    }
  },
  {
    "id": 1,
    "status": "running",
    "stage": "build",
    "name": "build",
    "ref": "master",
    "tag": false,
    "coverage": null,
    "created_at": "2017-09-22T16:56:23.660Z",
    "started_at": "2017-09-22T16:56:24.923Z",
    "finished_at": null,
    "user": <snip>
    "commit": <snip>
    "runner": {
      "id": 1,
      "description": "ip-10-98-181-67",
      "active": true,
      "is_shared": false,
      "name": "gitlab-ci-multi-runner"
    },
    "pipeline": {
      "id": 1,
      "sha": "19adf1b443ff9fcb9ca6ce514d64ba48aa4c2d89",
      "ref": "master",
      "status": "running"
    }
  }
]

Output of checks

(If you are reporting a bug on GitLab.com, write: This bug happens on GitLab.com)

Results of GitLab environment info

Expand for output related to GitLab environment info
ubuntu@ip-10-98-181-67:/var/log/gitlab$ sudo gitlab-rake gitlab:env:info

System information System: Ubuntu 14.04 Current User: git Using RVM: no Ruby Version: 2.3.5p376 Gem Version: 2.6.13 Bundler Version:1.13.7 Rake Version: 12.0.0 Redis Version: 3.2.5 Git Version: 2.13.5 Sidekiq Version:5.0.4 Go Version: unknown

GitLab information Version: 10.0.0 Revision: bbf5c73 Directory: /opt/gitlab/embedded/service/gitlab-rails DB Adapter: postgresql URL: http://ec2-54-170-105-106.eu-west-1.compute.amazonaws.com HTTP Clone URL: http://ec2-54-170-105-106.eu-west-1.compute.amazonaws.com/some-group/some-project.git SSH Clone URL: git@ec2-54-170-105-106.eu-west-1.compute.amazonaws.com:some-group/some-project.git Using LDAP: no Using Omniauth: no

GitLab Shell Version: 5.9.0 Repository storage paths:

  • default: /var/opt/gitlab/git-data/repositories Hooks: /opt/gitlab/embedded/service/gitlab-shell/hooks Git: /opt/gitlab/embedded/bin/git

Results of GitLab application Check

Expand for output related to the GitLab application check
ubuntu@ip-10-98-181-67:/var/log/gitlab$ sudo gitlab-rake gitlab:check SANITIZE=true
Checking GitLab Shell ...

GitLab Shell version >= 5.9.0 ? ... OK (5.9.0) Repo base directory exists? default... yes Repo storage directories are symlinks? default... no Repo paths owned by git:root, or git:git? default... yes Repo paths access is drwxrws---? default... yes hooks directories in repos are links: ... 1/1 ... ok Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Check GitLab API access: OK Redis available via internal API: OK

Access to /var/opt/gitlab/.ssh/authorized_keys: OK gitlab-shell self-check successful

Checking GitLab Shell ... Finished

Checking Sidekiq ...

Running? ... yes Number of Sidekiq processes ... 1

Checking Sidekiq ... Finished

Reply by email is disabled in config/gitlab.yml Checking LDAP ...

LDAP is disabled in config/gitlab.yml

Checking LDAP ... Finished

Checking GitLab ...

Git configured correctly? ... yes Database config exists? ... yes All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config up to date? ... yes Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory exists? ... yes Uploads directory has correct permissions? ... yes Uploads directory tmp has correct permissions? ... skipped (no tmp uploads folder yet) Init script exists? ... skipped (omnibus-gitlab has no init script) Init script up-to-date? ... skipped (omnibus-gitlab has no init script) Projects have namespace: ... 1/1 ... yes Redis version >= 2.8.0? ... yes Ruby version >= 2.3.3 ? ... Exception: undefined method run_command' for SystemCheck::App::RubyVersionCheck:Class Did you mean? run_commands Git version >= 2.7.3 ? ... Exception: undefined method run_command' for SystemCheck::App::GitVersionCheck:Class Did you mean? run_commands Git user has default SSH configuration? ... yes Active users: ... 1

Checking GitLab ... Finished

Possible fixes

Gitlab is great software thanks for reading my bug report!