Final update lost when job completes while GitLab server is restarting

Summary

I have encountered an issue where GitLab Runner discards the final update for a job (leaving it stuck in the pipeline as “running”) if the GitLab server is restarting (and therefore unavailable) when the job finishes.

Steps to reproduce

  1. Run a CI job that will finish during step 2 (sleep 10 will do).
  2. Upgrade GitLab Omnibus, thus causing several minutes of downtime while migrations etc. run.
  3. Observe that job finished, but GitLab still thinks it’s running. See also logs below.

Actual behavior

Job is stuck “running” because GitLab Runner has given up reporting the status; it will eventually time out and be marked as failed.

Expected behavior

GitLab Runner waits until server comes back and correctly reports job status.

This used to work correctly; for example, I went back and found some jobs which were running on gitlab-runner 16.7.0 (102c81ba) during some downtime and successfully reported their completion once the server came back up.

Relevant logs and/or screenshots

runner log
Checking for jobs... received                       job=10815 repo_url=https://gitlab.example.com/wolf/test-project.git runner=Vf4CL5e2                                                                  
Added job to processing list                        builds=1 job=10815 max_builds=1 project=3 repo_url=https://gitlab.example.com/wolf/test-project.git time_in_queue_seconds=3
Appending trace to coordinator...ok                 code=202 job=10815 job-log=0-1439 job-status=running runner=Vf4CL5e2 sent-log=0-1438 status=202 Accepted update-interval=1m0s                                
Job succeeded                                       duration_s=12.658687143 job=10815 project=3 runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused  runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused  runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused  runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused  runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused  runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused  runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused  runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused  runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused  runner=Vf4CL5e2
ERROR: Appending trace to coordinator... error couldn't execute PATCH against https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false: Patch "https://gitlab.example.com/api/v4/jobs/10815/trace?debug_trace=false": dial tcp 10.1.1.10:443: connect: connection refused  runner=Vf4CL5e2
ERROR: Job trace termination "Success" failed       duration_s=12.658687143 error=received invalid patch trace response job=10815 project=3 runner=Vf4CL5e2                                                      
Removed job from processing list                    builds=0 job=10815 max_builds=1 project=3 repo_url=https://gitlab.example.com/wolf/test-project.git time_in_queue_seconds=3
WARNING: Checking for jobs... failed                runner=Vf4CL5e2 status=couldn't execute POST against https://gitlab.example.com/api/v4/jobs/request: Post "https://gitlab.example.com/api/v4/jobs/reque
st": dial tcp 10.1.1.10:443: connect: connection refused
WARNING: Checking for jobs... failed                runner=Vf4CL5e2 status=couldn't execute POST against https://gitlab.example.com/api/v4/jobs/request: Post "https://gitlab.example.com/api/v4/jobs/reque
st": dial tcp 10.1.1.10:443: connect: connection refused

Environment description

Self-hosted, GitLab Omnibus + GitLab Runner with shell executor

Used GitLab Runner version

Version:      17.3.1
Git revision: 66269445
Git branch:   17-3-stable
GO version:   go1.22.5
Built:        2024-08-21T15:24:26+0000
OS/Arch:      linux/amd64

Possible fixes

From the log output I believe this is probably cause by !4692 (merged) “Give up on the trace finalUpdate if it keeps on failing”, introduced by @ratchade in gitlab-runner v16.11.0.

Some ideas that spring to mind:

  • Increase defaultFinalUpdateTriesCount from its current value of 10.

    Notably, DefaultTraceFinalizeTimeout (which was introduced by the same MR, but as far as I can tell is dead code) is set to 60 minutes, but the combination of defaultFinalUpdateTriesCount = 10 plus the default retry behavior seems to limit the actual timeout to less than 60 seconds.

  • Make defaultFinalUpdateTriesCount configurable so I can increase it myself. (The fact that it’s named default makes me think it is configurable, but I can’t figure out where.)

  • Continue retrying the final update indefinitely if the problem is that the entire GitLab server is unavailable; in this situation we know giving up won’t help: we can’t contact the server to get a new job anyway.