Fix infinite loops on trace updates
What does this MR do?
Implements a maximum retries check that guards failure
cases in two loops (here and here) that handle the finalization of job execution.
Why was this MR needed?
Currently in cases like gitlab-com/gl-infra/production#3441 (closed), when any API call for a particular job will respond with 500
(or any other response that Runner treats as failure
), Runner enters an infinite loop while sending the final trace patch or final job status update.
This should not happen!
The worst case scenario should be that we will have not full information sent back from Runner to GitLab (which is not the biggest problem of this job). Not a Runner that indefinitely hangs on handling a specific job.
What's the best way to test this MR?
What are the relevant issue numbers?
Fixes #27569
Edited by Tomasz Maczukin