Use Snowplow to collect metrics for each Secure scan
The problem
Secure groups would like relevant information about executed scans so that they can make evidence-based product decisions.
Proposal
Secure uses Sisense charts with data sourced from usage metrics to achieve insights and to make evidence-based decisions.
This issue outlines the limitations of this approach and proposes to use Snowplow events as an additional means to source metrics about Secure scans.
It is not proposed that north-star metrics are replaced by data sourced from this approach.
Why usage analytics isn't enough
Adding more metrics to usage analytics is of limited benefit due to the following limitations:
- Requires heavy use of the Rails database
- Metrics are required to be stored for all of time (assuming global counts of executed scans need to be measured)
- Database performance/size limitations mean scans are only stored for default branches, thus information isn't reported for scans run on branches
- Queries are expensive and are hard to optimize
- The information sent in usage pings is based on the assumption that a successful CI job means a successful scan. It also does not capture data for retried scans, or when there is more than one scan in a job
- Only reports aggregated information, limiting the ability to slice up information for a sisense query/chart
- There is a 38 day cycle time (time from data collected to reported)
Usage ping does have a place, as it runs on both GitLab.com and on self-managed instances.
Reference: Produce intelligence usage-ping workshop.
Proposing: record scans using a custom Snowplow event
Overview
This issue proposes that each time a Secure scan report is ingested by Rails it sends a custom event to the Snowplow collector. This solves many of the issues with usage analytics:
- Data on branches can be captured as there is no requirement to have data in the database
- Zero performance impact on the Rails database
- Sending raw event data means it could be sliced/queried many ways in Sisense
- A natural place to add more properties in future
- Cycle time should be faster (to be confirmed) as sending an event is fast
- Sisense charts could be based on real scan data, not assumptions based on CI job data
There are limitations:
- Does not run on self-managed instances
- Engineers need to be careful to not send the same metric event more than once
Possible insights
The following information could be sent with every event:
-
action
Likelysecurity_scanned
, or some hardcoded value that is unique for secure scanners -
category
Likely a description of the Secure analyzer, e.g.[secure:sast:bandit]
-
user
The user that ran the scan -
project
The project that the scan ran on -
schema_version
The schema version used by the report -
start_time
The UTC start time of the report -
end_time
The UTC end time of the report -
status
Whether or not the scan succeeded or failed -
vendor
The name of the vendor e.g.GitLab
-
scanner
A unique identifier of the scanner e.g.zaproxy-browserker
-
scanner_version
A string representing the version of the scanner e.g.1.0.45
Amongst others, this would allow for the following insights:
- Total number of scans run (default and branches) (GitLab.com only)
- Scan duration could be measured
- Failure rate could be measured (may require uploading a JSON report on scan failure)
- Versions of secure reports used can help determine which should be supported
- Adoption rate of new tools (e.g. Browserker) can be measured
- Understand how many clients are pinning to specific versions of analyzers
- Project/user allows the data to be sliced/grouped by each
In future, more properties could be added to this event. Information provided by this event is likely able to be sourced now as most Secure analyzers include the scan
field in outputted JSON reports.
Reference: Produce intelligence snowplow workshop.
Implementation details
Would use ::Gitlab::Tracking.event
with a custom event schema that would need to be defined in https://gitlab.com/gitlab-org/iglu.
Questions
-
What is the cycle time for Snowplow events? - "Real time" (see comment in video)
-
What happens to the data after the event is sent to Snowplow? - Documented at snowplow-request-flow
-
How can we query for event data in Sisense? - this may be a blocker
- Asked in slack
- Could potentially have a query such as
WHERE contexts LIKE '%UNIQUE_PROPERTY_NAME%'
- The issue #299354 (closed) will convert all context properties to top level fields, which will make this much easier to query
-
How do we build a custom JSON iglu
schema for use as a custom event? -
Should we use a self-describing event
or acustom context
?- Survey Responses is a good example, it looks like custom context will be enough
-
Need to understand if category
can be broken down by:
, for example, can we querysast
andbandit
separately for the category[secure:sast:bandit]
?- Just use a
LIKE
query, though we'll likely have the scan type information inproperty
.
- Just use a
-
Is this approach resilient to Sidekiq jobs being retried, or is there some kind of deduplication happening in Snowplow? See comment - No, it's not resilient to retries/duplicate jobs. Snowplow does not deduplicate events. This will have to be handled with an extra key
idempotency_key
, and each query will have to filter them out. - Example sql to filter out duplicate rows:
WITH summary AS (SELECT t.name, t.version, Row_number() OVER(partition BY t.idempotency_key) AS rank FROM analytics_table t) SELECT * FROM summary WHERE rank = 1;
- No, it's not resilient to retries/duplicate jobs. Snowplow does not deduplicate events. This will have to be handled with an extra key
-
How can we add scan type (SAST, DAST etc) specific metrics to this tracked event? - Add another field with scan type
-
Are there any privacy concerns using namespace/project ID? - Yes. The product intelligence team is waiting on a privacy policy update (see comment in video)
- For now, we pass through namespace/project ID and the collector in Rails will discard it.
Additional fields to parse
-
analyzer.id
-
analyzer.name
-
analyzer.version
-
analyzer.vendor.name
-
version
(schema version in report) -
scanner.version
Fields to track
-
category
secure:scan
-
action
scan
-
triggered_by
The id of the user that ran the scan -
project
The id of the project that the scan ran on -
report_schema_version
The schema version used by the JSON report -
start_time
The UTC start time of the report (if present) -
end_time
The UTC end time of the report (if present) -
status
Whether or not the scan succeeded or failed (if present) -
scan_type
sast, dast, api-fuzzing, etc. -
scanner
A unique identifier of the scanner e.g.zaproxy-browserker
(if present) -
scanner_vendor
The name of the vendor of the scanner e.g.ZAP
(if present) -
scanner_version
A string representing the version of the scanner e.g.2.10.0
(if present) -
analyzer
A unique identifier of the analyzer e.g.gitlab-dast
(if present) -
analyzer_vendor
The name of the vendor of the scanner e.g.GitLab
(if present) -
analyzer_version
A string representing the version of the scanner e.g.1.50.2
(if present)
Architectural Support
- Reminder: 72-hour SLA (note: not fussed about this requirement)
- Due Date: n/a
- DRI: @cam_swords
Scope Checklist
-
Does not involve architectural decisions -
Is after-the-fact -
Is not already covered by architecture guidelines/handbook -
Has a broad impact within #secure -
Is a new unit of work -
Is strictly #secure -
Could not come to an agreement (escalation) -
Involves architectural decisions See the scope scoring table below to interpret the checkboxes above
Scope Scoring Table
| Reason | in | opt-in | out | |------------------------------------------------------------|:---:|:------:|:---:|
| Does not involve architectural decisions | | | ?
| |
| Involves architectural decisions |