k8s runner sometime have hanged jobs, it didnt finish until timeout.
Summary
k8s runner sometimes(very few) hangs, it just print parts of log on job page.
while it hangs, I exec into the runner pod, and saw the logs are complete, the corresponding stage is finished. the runner log repeatly showing
time="2021-11-23T07:08:46Z" level=debug msg="Submitting job to coordinator... ok" code=200 job=1984266 job-status= runner=Wa4GnVZx update-interval=0s
Steps to reproduce
I actually dont know how to reproduce, retry always helps.
Actual behavior
the job hang until timeout
Expected behavior
the job don't hang.
Relevant logs and/or screenshots
the job page never show new logs
runner pod:
time="2021-11-23T07:03:59Z" level=debug msg="Container \"helper\" exited with error: <nil>" job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:03:59Z" level=debug msg="Executing build stage" build_stage=restore_cache job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:03:59Z" level=debug msg="Skipping stage (nothing to do)" build_stage=restore_cache job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:03:59Z" level=debug msg="Executing build stage" build_stage=download_artifacts job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:03:59Z" level=debug msg="Skipping stage (nothing to do)" build_stage=download_artifacts job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:03:59Z" level=debug msg="Executing build stage" build_stage=step_script job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:03:59Z" level=debug msg="\x1b[36;1mExecuting \"step_script\" stage of the job script\x1b[0;m" job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:03:59Z" level=debug msg="Starting Kubernetes command with attach..." job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:03:59Z" level=debug msg="Starting in container \"build\" the command [-----MY COMMAND-----] job=1984266 project=4 runner=Wa4GnVZx
time="2021-11-23T07:04:44Z" level=debug msg="Appending trace to coordinator... ok" code=202 job=1984266 job-log=0-17125 job-status=running runner=Wa4GnVZx sent-log=3548-17124 status="202 Accepted" update-interval=1m0s
time="2021-11-23T07:05:45Z" level=debug msg="Appending trace to coordinator... ok" code=202 job=1984266 job-log=0-23113 job-status=running runner=Wa4GnVZx sent-log=17125-23112 status="202 Accepted" update-interval=1m0s
time="2021-11-23T07:06:45Z" level=debug msg="Appending trace to coordinator... ok" code=202 job=1984266 job-log=0-25117 job-status=running runner=Wa4GnVZx sent-log=23113-25116 status="202 Accepted" update-interval=1m0s
time="2021-11-23T07:07:45Z" level=debug msg="Submitting job to coordinator... ok" code=200 job=1984266 job-status= runner=Wa4GnVZx update-interval=0s
time="2021-11-23T07:08:46Z" level=debug msg="Submitting job to coordinator... ok" code=200 job=1984266 job-status= runner=Wa4GnVZx update-interval=0s
time="2021-11-23T07:09:46Z" level=debug msg="Submitting job to coordinator... ok" code=200 job=1984266 job-status= runner=Wa4GnVZx update-interval=0s
time="2021-11-23T07:10:46Z" level=debug msg="Submitting job to coordinator... ok" code=200 job=1984266 job-status= runner=Wa4GnVZx update-interval=0s
time="2021-11-23T07:11:46Z" level=debug msg="Submitting job to coordinator... ok" code=200 job=1984266 job-status= runner=Wa4GnVZx update-interval=0s
time="2021-11-23T07:12:46Z" level=debug msg="Submitting job to coordinator... ok" code=200 job=1984266 job-status= runner=Wa4GnVZx update-interval=0s
worker pod logs by 'cat logs-4-1984266/output.log':
Running on runner-wa4gnvzx-project-4-concurrent-4xmbc5 via xxxx-755c55697c-2lk9n...
{"command_exit_code": 0, "script": "/scripts-4-1984266/prepare_script"}
$ echo ==================; # collapsed multi-line command
MY LOGS....
> Shared memory destroyed
{"command_exit_code": 0, "script": "/scripts-4-1984266/step_script"}
Environment description
FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY is false
Used GitLab Runner version
HELM version 0.28.0, 0.32.0
corresponding runner version 13.11.0, 14.2.0
Possible fixes
I checked the code, but have no lucky to find what signal gone can lead to job hangs.