k8s runner sometime have hanged jobs, it didnt finish until timeout.

Summary

k8s runner sometimes(very few) hangs, it just print parts of log on job page.

while it hangs, I exec into the runner pod, and saw the logs are complete, the corresponding stage is finished. the runner log repeatly showing

time="2021-11-23T07:08:46Z" level=debug msg="Submitting job to coordinator... ok" code=200 job=1984266 job-status= runner=Wa4GnVZx update-interval=0s

Steps to reproduce

I actually dont know how to reproduce, retry always helps.

Actual behavior

the job hang until timeout

Expected behavior

the job don't hang.

Relevant logs and/or screenshots

the job page never show new logs

runner pod:

time="2021-11-23T07:03:59Z" level=debug msg="Container \"helper\" exited with error: <nil>" job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:03:59Z" level=debug msg="Executing build stage" build_stage=restore_cache job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:03:59Z" level=debug msg="Skipping stage (nothing to do)" build_stage=restore_cache job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:03:59Z" level=debug msg="Executing build stage" build_stage=download_artifacts job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:03:59Z" level=debug msg="Skipping stage (nothing to do)" build_stage=download_artifacts job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:03:59Z" level=debug msg="Executing build stage" build_stage=step_script job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:03:59Z" level=debug msg="\x1b[36;1mExecuting \"step_script\" stage of the job script\x1b[0;m" job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:03:59Z" level=debug msg="Starting Kubernetes command with attach..." job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:03:59Z" level=debug msg="Starting in container \"build\" the command [-----MY COMMAND-----] job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:04:44Z" level=debug msg="Appending trace to coordinator... ok" code=202 job=1984266 job-log=0-17125 job-status=running runner=Wa4GnVZx sent-log=3548-17124 status="202 Accepted" update-interval=1m0s
time="2021-11-23T07:05:45Z" level=debug msg="Appending trace to coordinator... ok" code=202 job=1984266 job-log=0-23113 job-status=running runner=Wa4GnVZx sent-log=17125-23112 status="202 Accepted" update-interval=1m0s
time="2021-11-23T07:06:45Z" level=debug msg="Appending trace to coordinator... ok" code=202 job=1984266 job-log=0-25117 job-status=running runner=Wa4GnVZx sent-log=23113-25116 status="202 Accepted" update-interval=1m0s
time="2021-11-23T07:07:45Z" level=debug msg="Submitting job to coordinator... ok" code=200 job=1984266 job-status= runner=Wa4GnVZx update-interval=0s
time="2021-11-23T07:08:46Z" level=debug msg="Submitting job to coordinator... ok" code=200 job=1984266 job-status= runner=Wa4GnVZx update-interval=0s
time="2021-11-23T07:09:46Z" level=debug msg="Submitting job to coordinator... ok" code=200 job=1984266 job-status= runner=Wa4GnVZx update-interval=0s
time="2021-11-23T07:10:46Z" level=debug msg="Submitting job to coordinator... ok" code=200 job=1984266 job-status= runner=Wa4GnVZx update-interval=0s
time="2021-11-23T07:11:46Z" level=debug msg="Submitting job to coordinator... ok" code=200 job=1984266 job-status= runner=Wa4GnVZx update-interval=0s
time="2021-11-23T07:12:46Z" level=debug msg="Submitting job to coordinator... ok" code=200 job=1984266 job-status= runner=Wa4GnVZx update-interval=0s

worker pod logs by 'cat logs-4-1984266/output.log':

Running on runner-wa4gnvzx-project-4-concurrent-4xmbc5 via xxxx-755c55697c-2lk9n...

{"command_exit_code": 0, "script": "/scripts-4-1984266/prepare_script"}
$ echo ==================; # collapsed multi-line command

MY LOGS....
> Shared memory destroyed

{"command_exit_code": 0, "script": "/scripts-4-1984266/step_script"}

Environment description

FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY is false

Used GitLab Runner version

HELM version 0.28.0, 0.32.0

corresponding runner version 13.11.0, 14.2.0

Possible fixes

I checked the code, but have no lucky to find what signal gone can lead to job hangs.

Edited Nov 23, 2021 by chuan zhang