When job B depends on job A with the needs keyword, and job A is marked as allow_failure, job B is run even if job A failed.
Contrary to #31673 (closed), I can't see any obvious workaround to this issue.
It seems one could argue that it's rather bad wording than bad interaction, i.e. that needs:, as interpreted by the runner, is actually a directly_after:.
I support your "expected correct behavior", but it seems there's interference with allow_failure implicitly being set to true when a job has the parameter when: manual set (see e.g. !25823 (closed) for apparently current expected behaviors on various combinations)
Maybe a solution would be adding a strict or, to keep it consistent, allow_failure keyword to needs:?
Hello @dhershkovitch, a customer has opened a ticket (internal link) about this issue as well. Do you know if the milestone will be updated? Thank you!
We also would like to have this fixed. Use case: if e.g. a security scan job succeeds then another job with 'needs' annotated will update the binary attestation. But we do not want the whole pipeline to fail, so for the scan job we set 'allow_failures' to true. But now the binary attestation is done even if the scan fails.
@dhershkovitch@marknuzzo This is indeed an expected behavior but I think there is a feature request here. Here's the thread: #32598 (comment 1323490522). The user explained their use-case with an example. I am not sure if we have similar issues like this because it seems that this can be a common use-case. I also wonder if our Engineering Productivity uses this kind of workflow for our pipelines. Maybe, there is an existing solution or we should come up with one.
When job B depends on job A with the needs keyword, and job A is marked as allow_failure, job B is run even if job A failed.
@marknuzzo I am not aware that we defined any CI config that matches this description. I did a quick search in the gitlab repo for all the jobs with allow_failure: true rule, did my best tracking down if any of them are needed by a different job, but the search came empty. However we have so many jobs that are allowed to fail so I may have missed some.
Although, I wonder if you can try something like this:
Job A can upload an artifact with its exit status, or some indication of its success/fail status, have Job B pull this artifact (done automatically with the need keyword I believe), and read the value. Only proceed if the value is a passing status. Would that work?
If I may, this is what we tried down below at #32598 (comment 1330436265) (read from the part "> Have you considered using [...]").
In short: yes it works, but in term of usability it's not great, because if you « only proceed if the value is a passing status », in practice it means to exit 0 early, and so the job that you "skipped" is marked as successful. It's not equivalent to a proper "skip" that would appear in yellow in the pipeline.
Without trying to search around and see how we use allow_failure along with needs, I think I can still say that I hope we don't change the behaviour in any way, even though I do agree this might be a weird interaction, or at least ambiguous and unclear what should happen.
I hope we don't change the behaviour because we do use both extensively, along with when: on_failure and so on. A lot of things were done by trial and error. I would prefer we can add constructs that we can express the intention more clearly, rather than changing this. We can deprecate it, and we can follow up and update configurations where we're relying on this, with the new recommended constructs. Relying on exit code is a great idea.
Edited: And when: manual also plays a role here that allow_failure: true is added to maintain compatibility with the old manual jobs which are not blocking.
If it is the expected behavior, then it would be helpful to provide an alternative way of achieving the behavior that many users were expecting instead. For instance, by following the suggestion at #32598 (comment 296101771).
test1 must run IMO. build1 has allow_failure set, so when it fails, the pipeline keeps going. So test1 must run, it seems obvious, unless I'm missing something?
test2 is skipped? run?
test2 should NOT run. But right now it does, and now that it shipped this way for a while (3 years?), it can't be changed. But it should be possible to make it more flexible.
Let me try to elaborate. allow_failure is useful because we don't want build1 to fail the pipeline. But that doesn't mean that we don't care about the outcome of build1. We care, and we want to be able to make decisions based on this outcome. So there should be a way that jobX runs regardless of the outcome of build1 (and that's the current behavior with needs), and then jobY should run only if build1 succeeded, and why not have a jobZ that runs only if build1 failed.
Maybe all of this is already possible, and if so, I'd be happy to know how. Thanks!
@furkanayhan Sure! There it is (I used real names, instead of X, Y, Z for clarity):
image: debian:sidstages: - check - build - test# pre-build checkscheck-i386: stage: check script: exit 0 # yes, we can build for i386 architecture allow_failure: exit_codes: 77check-arm64: stage: check script: exit 77 # no, arm64 builds are not supported allow_failure: exit_codes: 77# buildsbuild: stage: build script: echo main buildbuild-i386: stage: build script: echo build i386 needs: [ check-i386 ]build-arm64: stage: build script: echo build arm64 needs: [ check-arm64 ]# teststest: stage: test script: echo test
Now let me explain that with plain words:
I want to build and test my program for a "main architecture", and those are the jobs build and test. If any of those fail, the pipeline fails.
Then I have extra architectures i386 and arm64. For any of those, I want to:
check if my program can build for this architecture (I don't have this information when I write the YAML, so I need to run a script to know that). These are the jobs check-*.
if it can build for this architecture, then I want to build it. Those are the jobs build-*.
Finally, I'm not going to run tests for extra archs. Running tests for the main architecture only is enough.
Picture:
And looking at the picture above: I wish that build-arm64 was skipped, based on the fact that check-arm64 previously failed.
@arnaudr Thanks for the example, now I see your use-case. However, I am not sure if using the failed status is a good idea for this because the check-* jobs are not failed actually, just finished with a result, right? Have you considered using artifacts for this? In the check-* jobs, you can create artifacts based on the result. Then in the build-* jobs, you can read the result and exit the build immediately.
Maybe, we should have a feature of allow_failure like (allow_failure:only_for_pipeline:true). Or maybe we should think about this kind of use case and come up with another better solution. We'll discuss this.
However, I am not sure if using the failed status is a good idea for this because the check-* jobs are not failed actually, just finished with a result, right?
Correct, the check-arm64 jobs finish with 77 in order to say "please skip the build for the arm64 architecture". To be exhaustive, the idea is that the various check-* scripts return the following:
0: indicate that I want to build the program
77: indicate that I want to skip the build
anything else: an error happened
But GitLab CI doesn't really allow me to take decisions based on the job's exit code. So the idea with allow_failure: 77 was to workaround that, but it doesn't really fly. Because, as is discussed here, build-arm64 will run anyway, even if check-arm64 failed before (due to allow_failure). And, from build-arm64, I don't have access to the exit code of check-arm64 anyway...
So, it's a bit clunky indeed.
Have you considered using artifacts for this? In the check-* jobs, you can create artifacts based on the result. Then in the build-* jobs, you can read the result and exit the build immediately.
I just tried, it is indeed a better approach. Here's the YAML I used:
image: debian:sidstages: - check - build - test# pre-build checkscheck-i386: stage: check script: touch .please-build artifacts: paths: [ .please-* ]check-arm64: stage: check script: touch .please-skip artifacts: paths: [ .please-* ]# buildsbuild: stage: build script: echo main buildbuild-i386: stage: build dependencies: [ check-i386 ] script: - if [ -e .please-skip ]; then echo skipping; exit 0; fi - if [ -e .please-build ]; then echo building; fibuild-arm64: stage: build dependencies: [ check-arm64 ] script: - if [ -e .please-skip ]; then echo skipping; exit 0; fi - if [ -e .please-build ]; then echo building; fi# teststest: stage: test script: echo test
However, and as you can see above, there's one major downside: there is absolutely no indication of whether a job was actually skipped or not. I have to check the logs to see that for the green build-i386, there was a successful build, while for the green build-arm64, in fact nothing was built.
So, using artifacts this way is still, IMHO, a workaround.
I have two suggestions for improving that:
1. rules + artifacts
The rules keyword could check if an artifact exists (or doesn't exist) in order to skip a job. At the moment, there is a rules:exists but it's only for files in the Git repo. So maybe extending rules:exists for artifacts, or having a new rules:artifacts keyword. That would be useful for this situation.
This kind of construct would only check if files exist (not read files and check their content). But that can already be quite powerful.
Alternatively, there could be a rules:job keyword, to decide what to do based on the exit code of other jobs. This way, there would be no need to create artifacts.
Although, with this kind of feature, I would still need allow_failure, because AFAIK a job can only return 0, everything else is considered a failure by GitLab. Does it feel like abusing the allow_failure keyword? If so, maybe introduce a new keyword consider_success to list non-zero exit codes that should be considered as success. But maybe that's going a bit too far.
Thank you for your comment. We are currently discussing it internally. And as you said, there is no other workaround other than using artifacts. (or dotenv variables).
I think one important detail with this proposal is to decide what happens if there's already a variable SKIP_BUILD_ARM64 defined in the YAML. Does it take precedence over what might have been set in the dotenv file? I'd say yes, IMO anything user-defined must take precedence over the rest.
"Dotenv variables/dependency variables" overrides all YAML variables, so we'd be good to go if we had this feature.
I just wanted to clarify this point.
I think I was thinking exactly the opposite: variables defined in the YAML, or defined in the pipeline settings, or defined by user for manual run, should always take precedence over what's in the dotenv. Because if user says FOO=bar, and then down the line some job write FOO=xyz in a file, who's right? IMO the user is right, it shouldn't be allowed that code overwrites a variable that was already clearly set by user.
But that's just my perspective, what seems logical to me, from where I come from. In any case, if the behavior "Dotenv variables/dependency variables overrides all YAML variables" is already what's out there, and clearly documented, then be it, there's nothing to discuss then.
I think I was thinking exactly the opposite: variables defined in the YAML, or defined in the pipeline settings, or defined by user for manual run, should always take precedence over what's in the dotenv.
Not only are "workarounds" using artifacts:reports:dotenv inappropriate for most use cases (as the dependant job still runs, and needs to be modified to know when to exit early), but for uses like this example where another project is triggered, it would be entirely wrong to require a change to the pipeline definition of other-group/other-project to support this.
This issue was originally raised as a bug. Should a Feature Proposal be created for the new functionality?