Skip to content

Pipeline gets stuck in a job when a self-hosted runner disconnects

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Summary

Im launching a self-hosted runner using CML, that its picking a job of the pipeline shown below. If I disconnect the runner during the command sleep 120 the pipeline gets stuck. Does not detect that the runner is not longer available and even if I start a new runner the job wont be repeated or anything, just stay running until timeout

stages:
  - ml
  - check

train:
  stage: ml
  tags:
    - gpu

  script:
    - echo 'Hi from CML!' >> report.md
    - cml-send-comment report.md

    - sleep 120

check:
  stage: check
  when: on_failure
  needs:
    - train

  script:
    - echo "Did it train?"

Steps to reproduce

  • Launch a self-hosted runner
  • Setup a very simple repo, just only the pipeline shown above is needed.
  • During the train sleep destroy your self-hosted runner

What is the current bug behavior?

Job stays running forever.

What is the expected correct behavior?

The job should stop

Proposal

It looks like there's an issue in StuckCIWorker that needs addressed per this comment from @fabiopitino.

Relevant logs and/or screenshots

image

Davids-MacBook-Pro:cml-spot-example davidgortega$ docker run --name runnerg --rm \
>     -e RUNNER_IDLE_TIMEOUT=1800 \
>     -e RUNNER_LABELS=gpu \
>     -e RUNNER_REPO=https://gitlab.com/DavidGOrtega/cml-spot-example-gitlab \
>     -e repo_token=$repo_token \
>     dvcorg/cml-gpu-py3-cloud-runner
Starting runner with shell executor
Registering Gitlab runner
{"arch":"amd64","level":"info","msg":"Runtime platform","os":"linux","pid":18,"revision":"6fbc7474","time":"2020-07-17T12:31:04Z","version":"13.1.1"}

{"level":"info","msg":"Starting runner for https://gitlab.com/ with token LGtng_o9 ...","time":"2020-07-17T12:31:04Z"}

{"job":644109193,"level":"info","msg":"Checking for jobs... received","repo_url":"https://gitlab.com/DavidGOrtega/cml-spot-example-gitlab.git","runner":"LGtng_o9","time":"2020-07-17T12:54:05Z"}

{"job":644109193,"level":"info","msg":"executor not supported","project":20009146,"referee":"metrics","runner":"LGtng_o9","time":"2020-07-17T12:54:05Z"}

{"duration":64542061600,"job":644109193,"level":"info","msg":"Job succeeded","project":20009146,"runner":"LGtng_o9","time":"2020-07-17T12:55:10Z"}

{"job":644125203,"level":"info","msg":"Checking for jobs... received","repo_url":"https://gitlab.com/DavidGOrtega/cml-spot-example-gitlab.git","runner":"LGtng_o9","time":"2020-07-17T13:02:17Z"}

{"job":644125203,"level":"info","msg":"executor not supported","project":20009146,"referee":"metrics","runner":"LGtng_o9","time":"2020-07-17T13:02:17Z"}

^CUnregistering runner
Runtime platform                                    arch=amd64 os=linux pid=244 revision=6fbc7474 version=13.1.1
Running in system-mode.                            
                                                   
Runtime platform                                    arch=amd64 os=linux pid=253 revision=6fbc7474 version=13.1.1
Running in system-mode.                            
                                                   
Unregistering runner from GitLab succeeded          runner=LGtng_o9
Shutting down docker machine

image

Edited by 🤖 GitLab Bot 🤖