Design: Update secure job status with corresponding exit code with correct icons

added Category:Fuzz Testing devopssecure + 1 deleted label

changed title from Update job with corresponding exit code with correct job icons and pipeline behaviors to Update secure job status with corresponding exit code with correct icons

mentioned in issue #235772 (closed)

assigned to @cam.x

changed the description

@cam.x I think we're grouping a few different problems together that we could solve individually. Some will have boring solutions and others may require ones that are a bit more involved. From my perspective, we could break them into:

Problem 1: Users incorrectly believe that a green check mark in the pipeline means job passed and found no vulns.

Boring solution

Update jobs to allow_failure:true in the CI template and then start failing the jobs if they have any vulns.

Benefit: Behavior would more closely match user expectation that we uncovered in the research issue. Communicates additional information without introducing friction. Requires changing only our CI templates and no new development. This is also what we already do for coverage fuzz testing so wouldn't be a big change to make for API fuzz testing as well.

Limitations: No information given beyond "something happened" in the job status.

Problem 2: User's don't understand if vulnerabilities are new from their own code changes or pre-existing in the repo.

Solution

Add logic to be able to determine new/existing status of vulnerabilities that were reported. Provide a different error code for each scenario and display a new icon accordingly. Would use the new exit code feature (e.g. allow_failure: 1,3,42)

Benefit: User's can quickly see more about what and why the security job failed.

Limitations: Technically complex to determine status. New development required to add updated icons and pipeline states. Exit codes not in the allow_failure list would fail the entire pipeline, which is not desirable.

For fuzz testing (and I'd propose other groups too), we should start with addressing problem 1, since that is a very minimal change. We can then tackle the second problem afterwards.

changed the description

mentioned in design management/design #262079 (closed)[tooltips.png]

@cam.x

When the "allow_failure" is set by the user when the job fails, we show a Red X. This doesn't really tell the user clearly why it failed though, which was because of vulnerabilities were found. Research results

This isn't quite right. The Red X only shows up if allow_failure is not present. If it is present, the yellow ! will show up. Here's a picture for coverage fuzz testing where we use allow_failure and the job failed.

Compared to a pipeline where a job failed and allow_failure was not present.

@cam.x tweaked the issue description to show that - feel free to tweak it!

@stkerr I got your point, I mixed the problem from research and our current behaviours in the description, which makes it confusing. Now I updated the description with your suggestion, please have a look!

Let me summarize a bit, to make sure we are on the same page:

What we all agreed on:

We should tell users in a clear why (= clear pipeline stunts) when there is a vulnerability found.
The ** yellow! ** is the best way to communicate that in order to differentiate from pass/fail by non-security related reasons.
Current behaviour is shown in the table in the issue description (I sum your update into a table and add more screenshots, please double check)

Solution step 1:

"allow_failure" by default
- Update doc, security part
- We need to inform current users about the change
- update icon and tooltip for "allow_failure"
- tooltip should come from backend end, all icons with the same job same status, should have the same tooltip

Solution Step 2:

We follow the exit code and update the user with more detailed information, like what you said in problem 2 in the comment: #300415 (comment 497223719)

We should create a new issue, right?

One big remaining question: `allow_failure: true` means: pass or fail, when vulns found?

According to our doc

Maybe that's also a question for @v_mishra or @nadia_sotnikova , I read the doc several times, I can't really make up my mind 100%. I actually more think that it means_passed with a warning_, instead of _failing with warning_?

After I am clear with this question, I will bring the solution table back: when shows what

@cam.x allow_failure: true means even if the particular job has fails, it will be considered a success and won't stop the jobs in the subsequent stages from running. Does that make it clear?

@v_mishra Thx for the fast reply, I guess, you mean we prefer to emphasise on the success part, even though it is a technical failure? Let me ask the question differently, if we need to choose one of the messages to start in the tooltip, _passed with a warning_, or _failing with warning_? which one would you recommend?

@cam.x ~~it will definitely be _passed with a warning_~~

@cam.x Sorry for the confusion! On further observation I figured that when allow_failure: true the pipeline status shows passed with warning because the status of the pipeline is a success irrespective of the status of the job, but the job status has to be shown as failed (with warning) because that's the actual status of the job. This wouldn't be a ~bug as I suspected, this is the expected behaviour.

But there's scope to keep the tooltip consistent for the job status on the Jobs page and on the graph in the pipeline overview page.

@v_mishra Ah, ok, got it. pipeline passed with warning job status fail with the warning. Those are message people are familiar with, so we shouldn't expect them to be confused seeing two yellow ! with different messages?

changed the description

@v_mishra Another question: do you know our technical ability? Can we show a consistent tooltip message for every identical job status, which means, on the pipeline overview page, on pipeline tab with flows and one of our security widget (Please see the new design here from @beckalippert)? I would prefer to have consistent messaging for the identical job status, and if we update one part, others should auto-updated as well. So that we avoid telling user different stuff.

For example, a message came from backend with identical job id or status id, so we show the same message?

@cam.x

Another question: do you know our technical ability? Can we show a consistent tooltip message for every identical job status, which means, on the pipeline overview page, on pipeline tab with flows and one of our security widget

That's what we do today. Example:

On Pipeline Overview Page	On MR Widget

@cam.x I'm still trying to wrap my head around all the possible states. Please let me know if this is the right way to think about the states being considered before trying to fill them in:

	job failed due to script/ technical failure	job failed due to allow_failure NOT set	job passed, NO vulns found	job passed, OLD vulns found	job passed, NEW vulns found	job passed, BOTH old and new vulns found
if allow_failure is set (if job fails AND/OR vulns detected, do NOT fail the pipeline)
if allow_failure is NOT set (if job fails AND/OR vulns detected, FAIL the pipeline)

It seems like some users would want allow_failure set on new vulns found but not old vulns detected. Is this true/ has this been validated from what we know?

Note: other possible statuses of a job in the pipeline are as follows: skipped, cancelled, paused, stopped, not found

@beckalippert @cam.x does this mean you're looking to selectively apply allow_failure: true to a set of combination of logics as opposed to how it works today(bypasses the status considering it as a success irrespective of the nature of failure)?

cc: @matteeyah @furkanayhan

@beckalippert this is what I came up with:

When allow_failure is NOT set

This should be the default setting, either true/false, I don't think there is a third option. Before I thought all default "allow_failure" is "false", but I found out in my example project, javafuzz behaves differently compare to cpp, go (different coding languages, same as SAST, each language has different scanners). So I prefer to use "allow_failure: true" or "allow_failure: false" in the table, and then say, what is our default settings.

New or old vulnerabilities found

I know that not all scanners can differentiate new or old, so there is a column that says Vulnerabilities found (No differentiation between new/old)

Hope it explains! Here is the link to doc

Errors in security jobs

Here I mean code error, syntax error, so something went wrong and the job can't run. This is the simplest situation I would say, we need to first tell the user there is an error. All other different situation happens when there is NO other errors.

That's how I see it. Hope I understand this correct this time.

cc: @stkerr @v_mishra @jmandell Please check the doc when you have time. The tooltip there is made up by me, feel free to edit.

PS: @stkerr explain myself a bit, I just put everything here as an overview with a future vision that we can differentiate New and Old vulns/bugs, it doesn't mean to I am pushing to have everything at one goal ;)

PS: it is just easier for me to edit in g-sheet than markdown :P

@cam.x Hm, I'm struggling to connect vulnerabilities found with the variable of allow_failure. In @theoretick's original issue, he proposed adding the ability to fail the pipeline if vulns are detected. If we're proposing allowing the user to set if vulnerabilities detected, then fail the job and fail the pipeline, this makes sense to me. In which case, if vulns found causes the job to fail and allow_failure = false, the pipeline would fail, wouldn't that make the states in the red box here untrue?

It seems like you said something different in this comment, where allow_failure = false would still show the "!" orange icon.

What seems ideal is to let the user determine the states of each (on a project-by-project basis):

For something like SAST we could even let the user specify whether they want this to apply to ALL vulns or only NEW ones.

Thoughts?

I'm not sure if this is something that @sam.white's team is accounting for on their policies page.

cc @andyvolpe @annabeldunstone @tmccaslin @derekferguson @stkerr @matt_wilson @NicoleSchwartz @jmandell

Thanks @beckalippert this is very similar to how we are thinking about policies currently in our prototype (be sure to select Scan Results as the policy type). I like the additional options that your mocks suggest (separating pass/fail jobs vs pass/fail pipelines as well as icon customization). We are accounting for most of this and can easily add the other options you suggest to our prototype.

allow_failure = false

@beckalippert Since you mentioned my previous issue, it was a bit older so while this used to be true IMO we can recenter this discussion around a new capability from the ~"group::verify" team, allow_failure exit codes, as discussed with #300183 (closed).

So instead of allow_failure being a simple true or false, it now allows us to be more precise in our failure condition. This is very similar to what we could eventually do with policies but I'm personally viewing this more as a "what should our template defaults be", rather than "what policies should we let our users set". Whether or not policies are configured, the proposed solutions can address default behaviors.

@beckalippert I like your idea and combine it with policy! I didn't think it could be a policy, I was only considering combine it with Security Gate, like in this issue. Now it feels make sense to have it as policy, I will add a comment to that issue.

Before we go forward with the user's choice, we still need to make the two points clear:

How we want the "allow_failure" behaves with current security job. Because if we have the policy, we either overwrite the "allow_failure" option or we disable the "allow_failure" for security-related job
What is the default as @theoretick mentioned there.

I made an update on the current behaviour after consulting @ypats (Thx :)) cc: @stkerr

We should understand that some points

allow_failure is a positive thing when users set it to true, it means If the job fails, we allow it, it is ok, the pipeline will be pass still. When it set to false, it means job fail is a job fail, we don't allow it, it should behave as a job fail, pipeline fail
'allow_failure' has a pre-condition to have an effect, which means the job has to fail first. otherwise no matter how user set allow_failure it doesn't matter.

Point 2 confused me a lot:

There is a bug in C for coverage-guided fuzzing, it behaves differently than others fuzzing jobs, sees the screenshot row-5. It won't fail the job whatever, so it always "Pass", no matter vulns found or not, no matter allow_failure is true or false
Other scanners behave differently from coverage guided fuzzing: Ex, SAST won't fail the job whatever, no matter vulns found or not, so it always "Pass", regardless allow_failure is true or false

You are right in the case: if vulns found causes the job to fail and allow_failure = false, the pipeline would fail and it should be a red X

I updated the screenshot in the description for both current behaviour and future desired behaviours. I hope this finally clarify things. We can have a sync call if needed.

changed the description

changed title from Update secure job status with corresponding exit code with correct icons to Design: Update secure job status with corresponding exit code with correct icons

mentioned in issue #300135 (closed)

changed the description

added sectionsec label

mentioned in issue #320803 (closed)

mentioned in issue #262079 (closed)

unassigned @cam.x

changed the description

Note for the issue: Fuzzing test is going to fix the bugs mentioned in the current behaviours for now, separate issue. So we will have in the future

allow_failure:true (default) - yellow warning: !

For all fuzz testing, including all languages for gl-fuzz an API fuzz

Other improvements with exit code and more detailed tooltip message might need to be done in the future. Keep this issue open as reference potential future changes

cc: @gitlab-com/gitlab-ux/secure-defend-ux @stkerr

mentioned in epic &5334 (closed)

added groupdynamic analysis label and removed 1 deleted label

added 1 deleted label

marked this issue as related to #235772 (closed)

marked this issue as related to ux-research#1299 (closed)

@derekferguson Put this in the backlog. I think this is a stage wide topic.

I agree. I don't want to solve this for one scanner. It should be consistent across everything.

@mfangman Just so you are aware this issue is out there.

changed milestone to %Backlog

added [deprecated] Accepting merge requests label

I got a customer who is really interested in this issue: https://gitlab.my.salesforce.com/0016100000NmTiM

@cam.x: What's the best way to move forward and put this issue in the roadmap?

side note we tried to approach changing the way we did exit codes during 14.0 for a 15.0 breaking change but we were unable to come around to solving across all of Secure and Protect for 15.0 so it will need to wait on 16.0 @yannmull not directly related to the UI implications but just a heads up in many cases (but not all) it has to align with an X.0 release

Oki thanks Nicole, it makes sense. I will share the status with the customer.

added customer label

mentioned in issue #358436

removed [deprecated] Accepting merge requests label

Design: Update secure job status with corresponding exit code with correct icons

Background

Problem

Current behaviours

Research result:

Key Problems:

Additional problem:

Proposed solutions

Additional note:

Designs

Child items ...

Activity