This is created by old issue because the original one has too many diverse discussions. From the latest comment thread, we came out with the action point so we created this new one to focus the discussion.
Problem
Users can set "allow_failure" in security jobs, which means Allow job to fail. A failed job does not cause the pipeline to fail
Current behaviours
Research result:
Green checkmarks doesn't communicate there are vulns found at all
Yellow! marks communicate the best there are vulns found
Red marks communicate the best there are vulns found
Key Problems:
Default allow_failure not set, which will trigger either a Red X or a Green checkmarks, none of them is ideal
Our tooltip message is not clear enough to tell why job fails
Our tooltip message is not consistent for the same status icon and shows at different pages
@stkerr Please correct me, if I am wrong. This issue will ONLY be updated with coverage/API fuzz testing. We hope all scanners could reach an agreement for consistency, but we also understand that different scanners will have different concerns.
Camellia X Yangchanged title from Update job with corresponding exit code with correct job icons and pipeline behaviors to Update secure job status with corresponding exit code with correct icons
changed title from Update job with corresponding exit code with correct job icons and pipeline behaviors to Update secure job status with corresponding exit code with correct icons
@cam.x I think we're grouping a few different problems together that we could solve individually. Some will have boring solutions and others may require ones that are a bit more involved. From my perspective, we could break them into:
Problem 1: Users incorrectly believe that a green check mark in the pipeline means job passed and found no vulns.
Boring solution
Update jobs to allow_failure:true in the CI template and then start failing the jobs if they have any vulns.
Benefit: Behavior would more closely match user expectation that we uncovered in the research issue. Communicates additional information without introducing friction. Requires changing only our CI templates and no new development. This is also what we already do for coverage fuzz testing so wouldn't be a big change to make for API fuzz testing as well.
Limitations: No information given beyond "something happened" in the job status.
Problem 2: User's don't understand if vulnerabilities are new from their own code changes or pre-existing in the repo.
Solution
Add logic to be able to determine new/existing status of vulnerabilities that were reported. Provide a different error code for each scenario and display a new icon accordingly. Would use the new exit code feature (e.g. allow_failure: 1,3,42)
Benefit: User's can quickly see more about what and why the security job failed.
Limitations: Technically complex to determine status. New development required to add updated icons and pipeline states. Exit codes not in the allow_failure list would fail the entire pipeline, which is not desirable.
For fuzz testing (and I'd propose other groups too), we should start with addressing problem 1, since that is a very minimal change. We can then tackle the second problem afterwards.
When the "allow_failure" is set by the user when the job fails, we show a Red X. This doesn't really tell the user clearly why it failed though, which was because of vulnerabilities were found. Research results
This isn't quite right. The Red X only shows up if allow_failure is not present. If it is present, the yellow ! will show up. Here's a picture for coverage fuzz testing where we use allow_failure and the job failed.
Compared to a pipeline where a job failed and allow_failure was not present.
@stkerr I got your point, I mixed the problem from research and our current behaviours in the description, which makes it confusing. Now I updated the description with your suggestion, please have a look!
Let me summarize a bit, to make sure we are on the same page:
What we all agreed on:
We should tell users in a clear why (= clear pipeline stunts) when there is a vulnerability found.
The ** yellow! ** is the best way to communicate that in order to differentiate from pass/fail by non-security related reasons.
Current behaviour is shown in the table in the issue description (I sum your update into a table and add more screenshots, please double check)
Solution step 1:
"allow_failure" by default
Update doc, security part
We need to inform current users about the change
update icon and tooltip for "allow_failure"
tooltip should come from backend end, all icons with the same job same status, should have the same tooltip
Solution Step 2:
We follow the exit code and update the user with more detailed information, like what you said in problem 2 in the comment: #300415 (comment 497223719)
We should create a new issue, right?
One big remaining question: allow_failure: true means: pass or fail, when vulns found?
Maybe that's also a question for @v_mishra or @nadia_sotnikova , I read the doc several times, I can't really make up my mind 100%. I actually more think that it means_passed with a warning_, instead of _failing with warning_?
After I am clear with this question, I will bring the solution table back: when shows what
@cam.xallow_failure: true means even if the particular job has fails, it will be considered a success and won't stop the jobs in the subsequent stages from running. Does that make it clear?
@v_mishra Thx for the fast reply, I guess, you mean we prefer to emphasise on the success part, even though it is a technical failure? Let me ask the question differently, if we need to choose one of the messages to start in the tooltip, _passed with a warning_, or _failing with warning_? which one would you recommend?
@cam.x Sorry for the confusion!
On further observation I figured that when allow_failure: true the pipeline status shows passed with warning because the status of the pipeline is a success irrespective of the status of the job, but the job status has to be shown as failed (with warning) because that's the actual status of the job. This wouldn't be a ~bug as I suspected, this is the expected behaviour.
But there's scope to keep the tooltip consistent for the job status on the Jobs page and on the graph in the pipeline overview page.
@v_mishra Ah, ok, got it. pipeline passed with warning job status fail with the warning. Those are message people are familiar with, so we shouldn't expect them to be confused seeing two yellow ! with different messages?
@v_mishra Another question: do you know our technical ability? Can we show a consistent tooltip message for every identical job status, which means, on the pipeline overview page, on pipeline tab with flows and one of our security widget (Please see the new design here from @beckalippert)? I would prefer to have consistent messaging for the identical job status, and if we update one part, others should auto-updated as well. So that we avoid telling user different stuff.
For example, a message came from backend with identical job id or status id, so we show the same message?
Another question: do you know our technical ability? Can we show a consistent tooltip message for every identical job status, which means, on the pipeline overview page, on pipeline tab with flows and one of our security widget
@cam.x I'm still trying to wrap my head around all the possible states. Please let me know if this is the right way to think about the states being considered before trying to fill them in:
job failed due to script/ technical failure
job failed due to allow_failure NOT set
job passed, NO vulns found
job passed, OLD vulns found
job passed, NEW vulns found
job passed, BOTH old and new vulns found
if allow_failure is set (if job fails AND/OR vulns detected, do NOT fail the pipeline)
if allow_failure is NOT set (if job fails AND/OR vulns detected, FAIL the pipeline)
It seems like some users would want allow_failure set on new vulns found but not old vulns detected. Is this true/ has this been validated from what we know?
Note: other possible statuses of a job in the pipeline are as follows: skipped, cancelled, paused, stopped, not found
@beckalippert@cam.x does this mean you're looking to selectively apply allow_failure: true to a set of combination of logics as opposed to how it works today(bypasses the status considering it as a success irrespective of the nature of failure)?
This should be the default setting, either true/false, I don't think there is a third option. Before I thought all default "allow_failure" is "false", but I found out in my example project, javafuzz behaves differently compare to cpp, go (different coding languages, same as SAST, each language has different scanners). So I prefer to use "allow_failure: true" or "allow_failure: false" in the table, and then say, what is our default settings.
New or old vulnerabilities found
I know that not all scanners can differentiate new or old, so there is a column that says Vulnerabilities found (No differentiation between new/old)
Here I mean code error, syntax error, so something went wrong and the job can't run. This is the simplest situation I would say, we need to first tell the user there is an error. All other different situation happens when there is NO other errors.
That's how I see it. Hope I understand this correct this time.
PS: @stkerr explain myself a bit, I just put everything here as an overview with a future vision that we can differentiate New and Old vulns/bugs, it doesn't mean to I am pushing to have everything at one goal ;)
PS: it is just easier for me to edit in g-sheet than markdown :P
@cam.x Hm, I'm struggling to connect vulnerabilities found with the variable of allow_failure. In @theoretick's original issue, he proposed adding the ability to fail the pipeline if vulns are detected. If we're proposing allowing the user to set if vulnerabilities detected, then fail the job and fail the pipeline, this makes sense to me. In which case, if vulns found causes the job to fail and allow_failure = false, the pipeline would fail, wouldn't that make the states in the red box here untrue?
It seems like you said something different in this comment, where allow_failure = false would still show the "!" orange icon.
What seems ideal is to let the user determine the states of each (on a project-by-project basis):
For something like SAST we could even let the user specify whether they want this to apply to ALL vulns or only NEW ones.
Thoughts?
I'm not sure if this is something that @sam.white's team is accounting for on their policies page.
Thanks @beckalippert this is very similar to how we are thinking about policies currently in our prototype (be sure to select Scan Results as the policy type). I like the additional options that your mocks suggest (separating pass/fail jobs vs pass/fail pipelines as well as icon customization). We are accounting for most of this and can easily add the other options you suggest to our prototype.
@beckalippert Since you mentioned my previous issue, it was a bit older so while this used to be true IMO we can recenter this discussion around a new capability from the ~"group::verify" team, allow_failure exit codes, as discussed with #300183 (closed).
So instead of allow_failure being a simple true or false, it now allows us to be more precise in our failure condition. This is very similar to what we could eventually do with policies but I'm personally viewing this more as a "what should our template defaults be", rather than "what policies should we let our users set". Whether or not policies are configured, the proposed solutions can address default behaviors.
@beckalippert I like your idea and combine it with policy! I didn't think it could be a policy, I was only considering combine it with Security Gate, like in this issue. Now it feels make sense to have it as policy, I will add a comment to that issue.
Before we go forward with the user's choice, we still need to make the two points clear:
How we want the "allow_failure" behaves with current security job. Because if we have the policy, we either overwrite the "allow_failure" option or we disable the "allow_failure" for security-related job
What is the default as @theoretick mentioned there.
I made an update on the current behaviour after consulting @ypats (Thx :)) cc: @stkerr
We should understand that some points
allow_failure is a positive thing when users set it to true, it means If the job fails, we allow it, it is ok, the pipeline will be pass still. When it set to false, it means job fail is a job fail, we don't allow it, it should behave as a job fail, pipeline fail
'allow_failure' has a pre-condition to have an effect, which means the job has to fail first. otherwise no matter how user set allow_failure it doesn't matter.
Point 2 confused me a lot:
There is a bug in C for coverage-guided fuzzing, it behaves differently than others fuzzing jobs, sees the screenshot row-5. It won't fail the job whatever, so it always "Pass", no matter vulns found or not, no matter allow_failure is true or false
Other scanners behave differently from coverage guided fuzzing: Ex, SAST won't fail the job whatever, no matter vulns found or not, so it always "Pass", regardless allow_failure is true or false
You are right in the case: if vulns found causes the job to fail and allow_failure = false, the pipeline would fail and it should be a red X
I updated the screenshot in the description for both current behaviour and future desired behaviours. I hope this finally clarify things. We can have a sync call if needed.
Camellia X Yangchanged the descriptionCompare with previous version
changed the description
Camellia X Yangchanged the descriptionCompare with previous version
changed the description
Camellia X Yangchanged the descriptionCompare with previous version
changed the description
Camellia X Yangchanged the descriptionCompare with previous version
changed the description
Camellia X Yangchanged title from Update secure job status with corresponding exit code with correct icons to Design: Update secure job status with corresponding exit code with correct icons
changed title from Update secure job status with corresponding exit code with correct icons to Design: Update secure job status with corresponding exit code with correct icons
Note for the issue: Fuzzing test is going to fix the bugs mentioned in the current behaviours for now, separate issue. So we will have in the future
allow_failure:true (default) - yellow warning: !
For all fuzz testing, including all languages for gl-fuzz an API fuzz
Other improvements with exit code and more detailed tooltip message might need to be done in the future. Keep this issue open as reference potential future changes
side note we tried to approach changing the way we did exit codes during 14.0 for a 15.0 breaking change but we were unable to come around to solving across all of Secure and Protect for 15.0 so it will need to wait on 16.0 @yannmull not directly related to the UI implications but just a heads up in many cases (but not all) it has to align with an X.0 release
@ichiang for Ultimate users we support security policies that will block a merge request when vulnerabilities are found. We provide much more granular criteria here beyond just a pass/fail request. Could you take a look at this feature and see if it will meet the customer's need?
The use case for the customer is that in their current workflow, merge requests may require multiple approvals for it to then be ready for merging. (This could take maybe 2-3 days and including weekends, may take 5 days or so)
When a merge request pipeline is triggered, they are looking for the job to fail in this specific use-case so that no critical/severe vulnerability will be allowed to merge if new vulnerabilities are detected. This is not the case with our workflows as pipelines will still show as green. To the customer, a green pipeline also means that the pipeline is "free of vulnerabilities" and hence should not require additional work to fix.
We understand that this is not the case at GitLab but it is not intuitive to a lot of customers that we are encountering here.
@jonlimr we have found that when failing the pipeline itself, eventually customers run into a situation where they urgently need to merge something in anyway. Maybe their production environment is down, maybe the vulnerability was a false positive.... regardless of the reason, I have repeatedly seen customers who try that approach eventually run into a need to "override" and allow the merge.
they are looking for the job to fail in this specific use-case so that no critical/severe vulnerability will be allowed to merge if new vulnerabilities are detected.
This underlying use case can still be accomplished with a Security Policy. It is true the pipeline will show as green, but the merge will not be allowed. Because they can choose who is eligible to approve the security policy, they could potentially even restrict this to just one eligible user. To be even more stringent, this eligible user could potentially be a service account with credentials that are restricted in a KMS. This way there would still potentially be a way to override if needed (users could access the credentials in the KMS, log in as the service account user, and approve), but otherwise the policy would prevent merge the rest of the time.
Would this approach meet their need? Generally speaking, we consider failing the pipeline to prevent merge due to vulnerabilities to be an anti-pattern and this is not a behavior that we currently plan to support on our roadmap.
Also worth mentioning, we have some improvements planned for the near-future to update the mergability widget to make it more clear what checks must pass before the MR can be merged. Perhaps would this help to make the reason for blocking the merge more obvious to the developer?