allow_failure:exit_codes prints Scary, Big Red ERROR at tail of log, inciting fear among pipeline users
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Problem Description
Jobs which employ allow_failure:exit_codes and which terminate with any of the non-zero exit codes produce a final job log message stating:
ERROR: Job failed: exit code N
This scares many a user into believing the job failed, despite the job script exiting with an acceptable exit code.
Several factors contribute to this fear:
- Use of the word ERROR.
- Repeated use of the word
failed(in the final job log message, and earlier in the log as can be seen in the attached example). - The entire line is printed in very visible RED text.
- It's at the very end of the log, and very prominent when bringing up the job log page.
As a result, two things happen which reduce productivity:
- Users attempt multiple re-runs of the job, wasting their time.
- Users reach out to the pipeline maintainers for a "resolution", wasting even more peoples time.
I believe the fundamental problem is that an exit code of non-zero is considered to always be an error, while allow_failure:exit_codes attempts to redefine what constitutes an actual error. Thus, while allow_failure:exit_codes enables the next dependent job in the pipeline to proceed, that is the ONLY effect - exit codes other than zero are still considered and treated as failures. This Issue asks to redefine such results as something other than a failure.
Use cases
In our org, we use the following tactic to enable users the opportunity to review changes to an environment prior to actually making those changes:
- First job produces a plan of changes to an environment/deployment
- 2nd job, a manual one, applies the plan from the first job if the review was acceptable
We use this pair of plan/apply jobs with two separate tools:
- Terraform, for planning and applying infrastructure changes
- Helm, for planning and applying application deployments
In both cases, we would like to have the plan job appear on the pipeline page with a green checkmark if there are no pending changes, but appear different (like with the orange exclamation mark) if there ARE pending changes. Thus, a green checkmark can save time as there is no need to review the plan and no need to run an apply. However, when it's not a green checkmark, users get confused by the Big Red ERROR message and fail to run the necessary "apply" job. They don't even bother to review the planned changes - all they see is the ERROR.
Examples
An example of such a job log:
Proposal
Rather than a Scary Big Red ERROR message, when a job script completes with an expected exit code, a more benign message indicating that all is not lost, and that the next job in the pipeline is okay to proceed would solve this problem.
Even better would be to expand the list of exit_codes to a map of integers:strings, where each exit_code can provide it's own message to print in the log.
Some examples (without custom exit_code strings):
<orange>WARNING: Job completed with acceptable exit code N</orange><orange>WARNING: Job completed, exit code N. Check log for details.</orange>
With custom exit_code strings:
<orange>WARNING: Job completed, exit code N: [msg]</orange>
