Revert MRs 5531 and 5676
What does this MR do?
Revert Improve runner build failure reasons (!5531 - merged) and Ensure BuildErrors have FailureReason (!5676 - merged)
Why was this MR needed?
!5676 (merged) and !5531 (merged) created a probably unexpected outcome.
Jobs that get canceled during some of the predefined steps that were wrapped with asRunnerSystemFailure() are now reported back to GitLab as failure_type=runner_system_failure.
That created a negative effect on our Hosted Runners alerting, where any runner_system_failure is treated as an incident indicator.
It's really hard to reproduce it, as there is a lot of concurrency and job canceling is detected by sending trace patches, but so far we see an elevated number of runner_system_failures that happen in such way:
- Job is transitioned to running and assigned to the runner
- Runner starts execution.
- In the meanwhile user force-cancel the job
- Job is not actively viewed through the UI, so Runner sends patches each 30 seconds - this is the interval in which it is able to detect that the job was canceled
- Because of that, job is still executed by the runner - it approaches the get_sources step
- Git request is initiated and it fails, as on GitLab side the job is already canceled so the token is already invalid.
- Failure is detected, job termination is initiated and the RunnerSystemFailure is assigned per https://gitlab.com/gitlab-org/gitlab-runner/blob/3b1118efc6b2f3cfe113c6b03e15ed623607f876/common/build.go#L537
- Final update approach is made and at this moment runner detects the job was canceled. But that doesn't change the failure reason.
- Job is finished. In GitLab it's already marked as canceled, but now it also gets the failure reason assigned.
GitLab is very likely not exposing that neither in UI nor API if the state is set to canceled. But:
- the failed jobs metric in Runner with the
runner_system_failurelabel is increased (which in cases like ours - when some threshold is reached - triggers alerting an infrastructure incident) - the failure reason is logged in runner logs, giving false feeling of the system state
- in the job it's also marked as Job failed (system failure): making user confused and basically lying, as in this case that was not the system failure but the job canceled and that not being detected fast enough by the Runner.