Pipeline gets stuck in a job when a self-hosted runner disconnects
Summary
Im launching a self-hosted runner using CML, that its picking a job of the pipeline shown below. If I disconnect the runner during the command sleep 120
the pipeline gets stuck. Does not detect that the runner is not longer available and even if I start a new runner the job wont be repeated or anything, just stay running until timeout
stages:
- ml
- check
train:
stage: ml
tags:
- gpu
script:
- echo 'Hi from CML!' >> report.md
- cml-send-comment report.md
- sleep 120
check:
stage: check
when: on_failure
needs:
- train
script:
- echo "Did it train?"
Steps to reproduce
- Launch a self-hosted runner
- Setup a very simple repo, just only the pipeline shown above is needed.
- During the train sleep destroy your self-hosted runner
What is the current bug behavior?
Job stays running forever.
What is the expected correct behavior?
The job should stop
Proposal
It looks like there's an issue in StuckCIWorker
that needs addressed per this comment from @fabiopitino.
Relevant logs and/or screenshots
Davids-MacBook-Pro:cml-spot-example davidgortega$ docker run --name runnerg --rm \
> -e RUNNER_IDLE_TIMEOUT=1800 \
> -e RUNNER_LABELS=gpu \
> -e RUNNER_REPO=https://gitlab.com/DavidGOrtega/cml-spot-example-gitlab \
> -e repo_token=$repo_token \
> dvcorg/cml-gpu-py3-cloud-runner
Starting runner with shell executor
Registering Gitlab runner
{"arch":"amd64","level":"info","msg":"Runtime platform","os":"linux","pid":18,"revision":"6fbc7474","time":"2020-07-17T12:31:04Z","version":"13.1.1"}
{"level":"info","msg":"Starting runner for https://gitlab.com/ with token LGtng_o9 ...","time":"2020-07-17T12:31:04Z"}
{"job":644109193,"level":"info","msg":"Checking for jobs... received","repo_url":"https://gitlab.com/DavidGOrtega/cml-spot-example-gitlab.git","runner":"LGtng_o9","time":"2020-07-17T12:54:05Z"}
{"job":644109193,"level":"info","msg":"executor not supported","project":20009146,"referee":"metrics","runner":"LGtng_o9","time":"2020-07-17T12:54:05Z"}
{"duration":64542061600,"job":644109193,"level":"info","msg":"Job succeeded","project":20009146,"runner":"LGtng_o9","time":"2020-07-17T12:55:10Z"}
{"job":644125203,"level":"info","msg":"Checking for jobs... received","repo_url":"https://gitlab.com/DavidGOrtega/cml-spot-example-gitlab.git","runner":"LGtng_o9","time":"2020-07-17T13:02:17Z"}
{"job":644125203,"level":"info","msg":"executor not supported","project":20009146,"referee":"metrics","runner":"LGtng_o9","time":"2020-07-17T13:02:17Z"}
^CUnregistering runner
Runtime platform arch=amd64 os=linux pid=244 revision=6fbc7474 version=13.1.1
Running in system-mode.
Runtime platform arch=amd64 os=linux pid=253 revision=6fbc7474 version=13.1.1
Running in system-mode.
Unregistering runner from GitLab succeeded runner=LGtng_o9
Shutting down docker machine