Final update lost when job completes while GitLab server is restarting
Summary
I have encountered an issue where GitLab Runner discards the final update for a job (leaving it stuck in the pipeline as “running”) if the GitLab server is restarting (and therefore unavailable) when the job finishes.
Steps to reproduce
- Run a CI job that will finish during step 2 (
sleep 10will do). - Upgrade GitLab Omnibus, thus causing several minutes of downtime while migrations etc. run.
- Observe that job finished, but GitLab still thinks it’s running. See also logs below.
Actual behavior
Job is stuck “running” because GitLab Runner has given up reporting the status; it will eventually time out and be marked as failed.
Expected behavior
GitLab Runner waits until server comes back and correctly reports job status.
This used to work correctly; for example, I went back and found some jobs which were running on gitlab-runner 16.7.0 (102c81ba) during some downtime and successfully reported their completion once the server came back up.
Relevant logs and/or screenshots
runner log
Checking for jobs... received job=10815 repo_url=https://gitlab.example.com/wolf/test-project.git runner=Vf4CL5e2
Added job to processing list builds=1 job=10815 max_builds=1 project=3 repo_url=https://gitlab.example.com/wolf/test-project.git time_in_queue_seconds=3
Appending trace to coordinator...ok code=202 job=10815 job-log=0-1439 job-status=running runner=Vf4CL5e2 sent-log=0-1438 status=202 Accepted update-interval=1m0s
Job succeeded duration_s=12.658687143 job=10815 project=3 runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused runner=Vf4CL5e2
ERROR: Job trace termination "Success" failed duration_s=12.658687143 error=received invalid patch trace response job=10815 project=3 runner=Vf4CL5e2
Removed job from processing list builds=0 job=10815 max_builds=1 project=3 repo_url=https://gitlab.example.com/wolf/test-project.git time_in_queue_seconds=3
WARNING: Checking for jobs... failed runner=Vf4CL5e2 status=couldn't execute POST against https://gitlab.example.com/api/v4/jobs/request: Post "https://gitlab.example.com/api/v4/jobs/reque
st": dial tcp 10.1.1.10:443: connect: connection refused
WARNING: Checking for jobs... failed runner=Vf4CL5e2 status=couldn't execute POST against https://gitlab.example.com/api/v4/jobs/request: Post "https://gitlab.example.com/api/v4/jobs/reque
st": dial tcp 10.1.1.10:443: connect: connection refused
Environment description
Self-hosted, GitLab Omnibus + GitLab Runner with shell executor
Used GitLab Runner version
Version: 17.3.1
Git revision: 66269445
Git branch: 17-3-stable
GO version: go1.22.5
Built: 2024-08-21T15:24:26+0000
OS/Arch: linux/amd64
Possible fixes
From the log output I believe this is probably cause by !4692 (merged) “Give up on the trace finalUpdate if it keeps on failing”, introduced by @ratchade in gitlab-runner v16.11.0.
Some ideas that spring to mind:
-
Increase
defaultFinalUpdateTriesCountfrom its current value of10.Notably,
DefaultTraceFinalizeTimeout(which was introduced by the same MR, but as far as I can tell is dead code) is set to 60 minutes, but the combination ofdefaultFinalUpdateTriesCount = 10plus the defaultretrybehavior seems to limit the actual timeout to less than 60 seconds. -
Make
defaultFinalUpdateTriesCountconfigurable so I can increase it myself. (The fact that it’s nameddefaultmakes me think it is configurable, but I can’t figure out where.) -
Continue retrying the final update indefinitely if the problem is that the entire GitLab server is unavailable; in this situation we know giving up won’t help: we can’t contact the server to get a new job anyway.