Deduplicating or Remapping SAST findings with the same fingerprint

Problem to solve

As we more fully explore Semgrep analyzers and rules, we will be creating new analyzers that effectively duplicate what we have today. We should find a way to deduplicate these findings so that we're not creating two findings - and therefore two vulnerabilities - when we have more than one analyzer finding the same vulnerabilities when scanning projects of a given language.

This is a common issue across devopssecure as other teams deal with updating or creating vulnerability_occurrences that map to the same flaw identified by different scanners:

SAST: bandit -> semgrep (and beyond)
DS: bundler-audit -> gemnasium
DAST: zap -> browserker
CS: klar -> trivy

Further Details

As part of &5245 we are considering replacing analyzers/bandit with analyzers/semgrep with a duplicate ruleset. Following our current processes, this will result in a new scanner that returns the same findings with different identifiers (identifiers[0].type: "semgrep", not identifiers[0].type: "bandit").

There is concern this will duplicate DB records and break all existing vulnerability -> vulnerability_occurrences mappings.

The naive solution is to simply duplicate findings and rely on users to handle deduplication, but if possible we should attempt to preserve this relationship between our returned findings and the DB records for auditing and data integrity

Should we be concerned about duplicating findings? (same location, likely different data/descriptions)
Are findings with an identical location but a different scanner duplicates?
Is there a flexible way we can remap findings?

With the work in https://gitlab.com/groups/gitlab-org/-/epics/4690 we are exploring ways to rely on a new tracking field to separate file location from what we use to track movement of a finding, but the current scope does not include report types or identifiers. See WIP documentation MR for more explanation on this idea

Proposal

TBD

Architectural Support

Reminder: 72-hour SLA
Due Date: 2021-02-05
DRI: @theoretick

Scope Checklist

Does not involve architectural decisions
Is after-the-fact
Is not already covered by architecture guidelines/handbook
Has a broad impact within #secure
Is a new unit of work
Is strictly #secure
Could not come to an agreement (escalation)
Involves architectural decisions

See the scope scoring table below to interpret the checkboxes above

Scope Scoring Table

Reason	in	opt-in	out
Does not involve architectural decisions			❌
Is after-the-fact			❌
Is not already covered by architecture guidelines/handbook	❌	❌
Has a broad impact within Secure	❌
Is a new unit of work	❌	❌
Is strictly Secure	❌	❌
Could not come to an agreement (escalation)		`?`
Involves architectural decisions	❌	❌

Reviewed by

Auto-Summary 🤖

Discoto Usage

Points

Discussion points are declared by headings, list items, and single lines that start with the text (case-insensitive) point:. For example, the following are all valid points:

#### POINT: This is a point

* point: This is a point

+ Point: This is a point

- pOINT: This is a point

point: This is a **point**

Note that any markdown used in the point text will also be propagated into the topic summaries.

Outcomes

Outcomes define the decisions or resolutions of a discussion. Once outcomes are defined, sub-topics and points are collapsed underneath the outcomes.

Outcomes are declared in a similar manner as points:

#### OUTCOME: This is an outcome

* outcome: This is an outcome

+ Outcome: This is an outcome

- oUTCOME: This is an outcome

outcome: This is an outcome

Note that multiple outcomes may be declared for each topic.

Topics

Topics can be stand-alone and contained within an issuable (epic, issue, MR), or can be inline.

Inline topics are defined by creating a new thread (discussion) where the first line of the first comment is a heading that starts with (case-insensitive) topic:. For example, the following are all valid topics:

# Topic: Inline discussion topic 1

## TOPIC: **{+A Green, bolded topic+}**

### tOpIc: Another topic

Quick Actions

Action Description

/discuss sub-topic TITLE Create an issue for a sub-topic. Does not work in epics

/discuss link ISSUABLE-LINK Link an issuable as a child of this discussion

Discussion-Size Indicators

The relative size of the discussion occurring within a topic and its sub-topics is indicated via braille dots.

More dots means that more points or sub-topics exist within a given topic.

Examples:

TOPIC ⣿⣿⡆ A large discussion occurred here

TOPIC ⣇ A smaller discussion occurred here

Action	Description
`/discuss sub-topic TITLE`	Create an issue for a sub-topic. Does not work in epics
`/discuss link ISSUABLE-LINK`	Link an issuable as a child of this discussion

Last updated by this job

TOPIC ⣇ Deduping vs Remapping #299589 (comment 501726039)
TOPIC ⣿ ⢀ what can be delivered within %13.9 #299589 (comment 501733971)
- IMO within %13.9 we can dedupe but we cannot remap once we drop bandit entirely #299589 (comment 501893053)
TOPIC ⣿⡀ ⢠ Standardizing on Identifiers #299589 (comment 501902338)
- Inject CWE identifiers (if missing) into findings from officially supported scanners #299589 (comment 501902338)
- CWE is always primary, unless CVE exists. CVE > CWE #299589 (comment 501902338)
stable identifiers support organizations in SLOs and audit trails #299589 (comment 506245456)

Discoto Settings

---
summary:
  max_items: -1
  sort_by: created
  sort_direction: ascending

See the settings schema for details.

Edited Mar 24, 2021 by 🤖 GitLab Bot 🤖