Unexpected delay before a job effectively completes
Hi!
I just installed a fresh instance of Gitlab and Gitlab Runner inside a Kubernetes cluster using Helm charts.
Everything seems to be working fine, except that when a job runs, the Gitlab runner seems to take time to effectively complete the job.
As an example, I made a little job that just lists nodes on a Kubernetes cluster like this:
stages:
- test
test:
stage: test
image: "bitnami/kubectl:1.19.4"
script:
- export KUBECONFIG=$KUBE_DEV_KUBECONFIG
- kubectl get nodes
Pushing this triggers a pipeline and I can see the list of nodes in my cluster in the jobs' logs. The job displays "Job succeeded" at the end. However, the job is still in the "running" state and effectively completes after a couple of minutes (last time I timed it was 5m19s between the "Job succeeded" sentence being displayed and the job passes as "passed").
Switching the gitlab-runner to debug, I can see the following traces for the job:
Checking for jobs... received job=57 repo_url=https://#myrepo#/api.git runner=To-feDnn
Requeued the runner builds=1 runner=To-feDnn
Running with gitlab-runner 13.5.0 (ece86343) job=57 project=4 runner=To-feDnn
on gitlab-runner-gitlab-runner-8494df687c-jtl79 To-feDnn job=57 project=4 runner=To-feDnn
Dialing: tcp gitlab-webservice:8080 ...
Preparing the "kubernetes" executor job=57 project=4 runner=To-feDnn
Shell configuration: environment: []
dockercommand:
- sh
- -c
- "if [ -x /usr/local/bin/bash ]; then\n\texec /usr/local/bin/bash \nelif [ -x /usr/bin/bash
]; then\n\texec /usr/bin/bash \nelif [ -x /bin/bash ]; then\n\texec /bin/bash \nelif
[ -x /usr/local/bin/sh ]; then\n\texec /usr/local/bin/sh \nelif [ -x /usr/bin/sh
]; then\n\texec /usr/bin/sh \nelif [ -x /bin/sh ]; then\n\texec /bin/sh \nelif [
-x /busybox/sh ]; then\n\texec /busybox/sh \nelse\n\techo shell not found\n\texit
1\nfi\n\n"
command: bash
arguments: []
passfile: false
extension: ""
job=57 project=4 runner=To-feDnn
Regex allowing overrides for Namespace is empty, disabling override. job=57 project=4 runner=To-feDnn
Regex allowing overrides for ServiceAccount is empty, disabling override. job=57 project=4 runner=To-feDnn
Regex allowing overrides for BearerToken is empty, disabling override. job=57 project=4 runner=To-feDnn
Regex allowing overrides for PodAnnotations is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for CPURequest is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for MemoryRequest is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for EphemeralStorageRequest is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for CPULimit is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for MemoryLimit is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for EphemeralStorageLimit is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for ServiceCPURequest is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for ServiceMemoryRequest is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for ServiceEphemeralStorageRequest is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for ServiceCPULimit is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for ServiceMemoryLimit is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for ServiceEphemeralStorageLimit is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for HelperCPURequest is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for HelperMemoryRequest is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for HelperEphemeralStorageRequest is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for HelperCPULimit is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for HelperMemoryLimit is empty, disabling override. job=57 project=4 runner=To-feDnn
setting allowing overrides for HelperEphemeralStorageLimit is empty, disabling override. job=57 project=4 runner=To-feDnn
Using Kubernetes namespace: gitlab job=57 project=4 runner=To-feDnn
Using Kubernetes executor with image bitnami/kubectl:1.19.4 ... job=57 project=4 runner=To-feDnn
Waiting for signals... job=57 project=4 runner=To-feDnn
No referees configured job=57 project=4 runner=To-feDnn
Executing build stage build_stage=prepare_script job=57 project=4 runner=To-feDnn
Preparing environment job=57 project=4 runner=To-feDnn
Starting Kubernetes command... job=57 project=4 runner=To-feDnn
Setting up secrets job=57 project=4 runner=To-feDnn
Setting up build pod job=57 project=4 runner=To-feDnn
Creating build pod job=57 project=4 runner=To-feDnn
Checking for jobs... nothing runner=To-feDnn
Creating pod proxy services job=57 project=4 runner=To-feDnn
Starting in container "helper" the command ["gitlab-runner-build"] with script: #!/usr/bin/env bash
set -eo pipefail
set +o noclobber
: | eval $'echo "Running on $(hostname) via gitlab-runner-gitlab-runner-8494df687c-jtl79..."\n'
exit 0
job=57 project=4 runner=To-feDnn
Dialing: tcp gitlab-webservice:8080 ...
Appending trace to coordinator... ok code=202 job=57 job-log=0-573 job-status=running runner=To-feDnn sent-log=0-572 status=202 Accepted update-interval=3s
Dialing: tcp gitlab-webservice:8080 ...
Appending trace to coordinator... ok code=202 job=57 job-log=0-813 job-status=running runner=To-feDnn sent-log=573-812 status=202 Accepted update-interval=3s
Container "helper" exited with error: <nil> job=57 project=4 runner=To-feDnn
Executing build stage build_stage=get_sources job=57 project=4 runner=To-feDnn
Getting source from Git repository job=57 project=4 runner=To-feDnn
Starting Kubernetes command... job=57 project=4 runner=To-feDnn
Starting in container "helper" the command ["gitlab-runner-build"] with script: #!/usr/bin/env bash
set -eo pipefail
set +o noclobber
#data export#
exit 0
job=57 project=4 runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Container "helper" exited with error: <nil> job=57 project=4 runner=To-feDnn
Executing build stage build_stage=restore_cache job=57 project=4 runner=To-feDnn
Skipping stage (nothing to do) build_stage=restore_cache job=57 project=4 runner=To-feDnn
Executing build stage build_stage=download_artifacts job=57 project=4 runner=To-feDnn
Skipping stage (nothing to do) build_stage=download_artifacts job=57 project=4 runner=To-feDnn
Executing build stage build_stage=step_script job=57 project=4 runner=To-feDnn
Executing "step_script" stage of the job script job=57 project=4 runner=To-feDnn
Starting Kubernetes command... job=57 project=4 runner=To-feDnn
Starting in container "build" the command ["sh" "-c" "if [ -x /usr/local/bin/bash ]; then\n\texec /usr/local/bin/bash \nelif [ -x /usr/bin/bash ]; then\n\texec /usr/bin/bash \nelif [ -x /bin/bash ]; then\n\texec /bin/bash \nelif [ -x /usr/local/bin/sh ]; then\n\texec /usr/local/bin/sh \nelif [ -x /usr/bin/sh ]; then\n\texec /usr/bin/sh \nelif [ -x /bin/sh ]; then\n\texec /bin/sh \nelif [ -x /busybox/sh ]; then\n\texec /busybox/sh \nelse\n\techo shell not found\n\texit 1\nfi\n\n"] with script: #!/usr/bin/env bash
set -eo pipefail
set +o noclobber
#data export#
exit 0
job=57 project=4 runner=To-feDnn
Dialing: tcp gitlab-webservice:8080 ...
Appending trace to coordinator... ok code=202 job=57 job-log=0-1548 job-status=running runner=To-feDnn sent-log=813-1547 status=202 Accepted update-interval=3s
Container "build" exited with error: <nil> job=57 project=4 runner=To-feDnn
Executing build stage build_stage=after_script job=57 project=4 runner=To-feDnn
Skipping stage (nothing to do) build_stage=after_script job=57 project=4 runner=To-feDnn
Executing build stage build_stage=archive_cache job=57 project=4 runner=To-feDnn
Skipping stage (nothing to do) build_stage=archive_cache job=57 project=4 runner=To-feDnn
Executing build stage build_stage=upload_artifacts_on_success job=57 project=4 runner=To-feDnn
Skipping stage (nothing to do) build_stage=upload_artifacts_on_success job=57 project=4 runner=To-feDnn
Skipping referees execution job=57 project=4 runner=To-feDnn
Executing build stage build_stage=cleanup_file_variables job=57 project=4 runner=To-feDnn
Cleaning up file based variables job=57 project=4 runner=To-feDnn
Starting Kubernetes command... job=57 project=4 runner=To-feDnn
Starting in container "helper" the command ["gitlab-runner-build"] with script: #!/usr/bin/env bash
set -eo pipefail
set +o noclobber
: | eval $'$\'rm\' "-f" "/builds/#myrepo#.tmp/KUBE_DEV_KUBECONFIG"\n$\'rm\' "-f" "/builds/#myrepo#.tmp/KUBE_DEV_KUBECONFIG"\n'
exit 0
job=57 project=4 runner=To-feDnn
Container "helper" exited with error: <nil> job=57 project=4 runner=To-feDnn
Job succeeded duration=11.327113727s job=57 project=4 runner=To-feDnn
Dialing: tcp gitlab-webservice:8080 ...
Appending trace to coordinator... ok code=202 job=57 job-log=0-1921 job-status=running runner=To-feDnn sent-log=1548-1920 status=202 Accepted update-interval=3s
Dialing: tcp gitlab-webservice:8080 ...
Submitting job to coordinator... accepted, but not yet completed code=202 job=57 job-status= runner=To-feDnn update-interval=1s
Dialing: tcp gitlab-webservice:8080 ...
Submitting job to coordinator... accepted, but not yet completed code=202 job=57 job-status= runner=To-feDnn update-interval=1s
Dialing: tcp gitlab-webservice:8080 ...
Submitting job to coordinator... accepted, but not yet completed code=202 job=57 job-status= runner=To-feDnn update-interval=1s
Dialing: tcp gitlab-webservice:8080 ...
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Submitting job to coordinator... accepted, but not yet completed code=202 job=57 job-status= runner=To-feDnn update-interval=4s
Dialing: tcp gitlab-webservice:8080 ...
Submitting job to coordinator... accepted, but not yet completed code=202 job=57 job-status= runner=To-feDnn update-interval=4s
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Dialing: tcp gitlab-webservice:8080 ...
Submitting job to coordinator... accepted, but not yet completed code=202 job=57 job-status= runner=To-feDnn update-interval=8s
Dialing: tcp gitlab-webservice:8080 ...
Submitting job to coordinator... accepted, but not yet completed code=202 job=57 job-status= runner=To-feDnn update-interval=8s
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Dialing: tcp gitlab-webservice:8080 ...
Submitting job to coordinator... accepted, but not yet completed code=202 job=57 job-status= runner=To-feDnn update-interval=16s
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Dialing: tcp gitlab-webservice:8080 ...
Submitting job to coordinator... accepted, but not yet completed code=202 job=57 job-status= runner=To-feDnn update-interval=16s
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Dialing: tcp gitlab-webservice:8080 ...
Submitting job to coordinator... accepted, but not yet completed code=202 job=57 job-status= runner=To-feDnn update-interval=32s
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Dialing: tcp gitlab-webservice:8080 ...
Submitting job to coordinator... accepted, but not yet completed code=202 job=57 job-status= runner=To-feDnn update-interval=32s
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Dialing: tcp gitlab-webservice:8080 ...
Submitting job to coordinator... accepted, but not yet completed code=202 job=57 job-status= runner=To-feDnn update-interval=1m4s
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Dialing: tcp gitlab-webservice:8080 ...
Submitting job to coordinator... accepted, but not yet completed code=202 job=57 job-status= runner=To-feDnn update-interval=1m4s
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Dialing: tcp gitlab-webservice:8080 ...
Submitting job to coordinator... accepted, but not yet completed code=202 job=57 job-status= runner=To-feDnn update-interval=1m4s
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Feeding runners to channel builds=1
Dialing: tcp gitlab-webservice:8080 ...
Checking for jobs... nothing runner=To-feDnn
Dialing: tcp gitlab-webservice:8080 ...
Submitting job to coordinator... ok code=200 job=57 job-status= runner=To-feDnn update-interval=0s
Feeding runners to channel builds=0
When the runner reaches the code=200 job=57 job-status= runner=To-feDnn update-interval=0s line, the job finally ends on the UI.
I can't figure out where the Submitting job to coordinator... accepted, but not yet completed happens (I see nothing in the Gitlab logs, but maybe I don't know where to look for), nor what it means. It seems that the "coordinator" is stuck on something for 5 minutes once the job is complete.
Here are some information about my setup:
- I used Gitlab Helm Chart 4.5.3 to deploy Gitlab 13.5.3
- I used Gitlab-Runner Helm Chart 0.22.0 to deploy a single gitlab-runner 13.5.0 (I tested with 13.6.0 -> same result)
The job described above is only an example, all my jobs with more or less complexity from all repositories are behaving the same way.
Any idea about what's happening? I can provide more information if requested.