Job steps timeout in 10 minutes, job stuck and fail
Status update:
- The root cause was an issue with DigitalOcean network. Refer to this comment.
Overview
TL;DR - various job steps run for "10:00" seconds without any output, following step hangs until 1h job timeout occurs.
GitLab CE 15.8.0 Installed via helm-chart, cloud provider - DigitalOcean DOKS. All data stored in S3 compatible storage (DigitalOcean Spaces).
We face this issue since 15.5.2. Before we run GitLab CE 14.x on separate VM and local storage - no issues.
Hello! Can't find anything similar to my issue. As stated above - some steps (almost any step) run for 10:00 without any output, next step hangs forever, until global gob timeout occurs.
My first thought it is related to S3 storage, since most jobs hangs on following steps (all of them work with S3):
- Restoring cache
- Saving cache for successful job
- Getting source from Git repository
but there is bunch of jobs which do not interact with S3 at all:
- Executing "step_script" stage of the job script
Enabling CI_DEBUG_TRACE=true
for jobs didn't show anything:
++ echo '[32;1mSkipping Git submodules setup[0;m'
[32;1mSkipping Git submodules setup[0;m
+ exit 0
+ runner_script_trap
+ exit_code=0
+ out_json='{"command_exit_code": 0, "script": "/scripts-12-152328/get_sources"}'
+ echo ''
+ echo '{"command_exit_code": 0, "script": "/scripts-12-152328/get_sources"}'
section_end:1673942876:get_sources
[0K+ exit 0
section_start:1673942876:step_script
[0K[0K[36;1mExecuting "step_script" stage of the job script[0;m[0;m
section_end:1673943476:step_script
[0Ksection_start:1673943476:cleanup_file_variables
[0K[0K[36;1mCleaning up project directory and file based variables[0;m[0;m
+ set -o
+ grep pipefail
+ set -o pipefail
+ set -o errexit
+ set +o noclobber
+ :
+ eval '$'\''rm'\'' -f /builds/fxg/frontend.tmp/CI_SERVER_TLS_CA_FILE
'
++ rm -f /builds/fxg/frontend.tmp/CI_SERVER_TLS_CA_FILE
+ exit 0
+ runner_script_trap
+ exit_code=0
+ out_json='{"command_exit_code": 0, "script": "/scripts-12-152328/cleanup_file_variables"}'
+ echo ''
+ echo '{"command_exit_code": 0, "script": "/scripts-12-152328/cleanup_file_variables"}'
+ exit 0
Enabling gitlab-runner debug logging didn't show any issues too. Coudn't find any errors in GitLab pods either. This issue happens few times every day for two months. Please help to track it down.