Upload artifact fails but jobs succeeds

Summary

Occasionally, after one of our build jobs is finished, it will start uploading the artifacts, and silently fail: the job succeeds but the artifacts are not uploaded and no errors are reported.

  • Jobs that depend on these artifacts then fail, which requires re-running the whole job again (as opposed to just re-running the upload)

Steps to reproduce

  • I have managed to reproduce the behavior (the helper crashing but job succeeding)
    • More details in this project https://gitlab.com/jpsamper/runner-helper-reproducer
    • In production, we usually see it when there are a lot of jobs running/uploading artifacts at the same time
      • We've seen it with as low as 10 jobs uploading 1.5GB zipped (4-5GB unzipped) concurrently
  • By using a custom gitlab-runner-helper with a lot more print statements, we have found that the logs stop after invoking r.client.Do (i.e. if we add a print statement right before and right after, the one right after never appears)
    • Naturally, this is the behavior when something goes wrong, when the artifact is uploaded correctly, we see both print statements

Actual behavior

If I understand correctly, the function call linked above is invoking Do from the net/http package, and that call seems to be crashing the gitlab-runner-helper with no additional error message/return code/etc.

Expected behavior

  • If an artifact upload fails, the job fails or retries
  • An informative error message too, ideally

Relevant logs and/or screenshots

job log

When everything works as expected:

Uploading artifacts...
foo: found 15933 matching files and directories 
bar: found 436 matching files and directories 
baz: found 1 matching files and directories 
qux: found 2 matching files and directories                   
Uploading artifacts as "archive" to coordinator... ok  id=1234567 responseStatus=201 Created token=*****                            
Job succeeded

And when it doesn't:

Uploading artifacts...
foo: found 15933 matching files and directories 
bar: found 436 matching files and directories 
baz: found 1 matching files and directories 
qux: found 2 matching files and directories                   
Job succeeded

Environment description

  • Gitlab Runner with docker executor kubernetes executor
  • Latest docker version
  • Default config.toml

Used GitLab Runner version

We're currently on gitlab-runner 13.3.0 but have been seeing this since at least 12.9.0

Possible fixes

  • The gitlab-runner-helper does a sanity check that the artifacts have actually been uploaded, and if not tries again
    • This could make sense as part of the docker image so that if the gitlab-runner-helper exits unexpectedly, the sanity check can still run
Edited by Juan Pablo Samper