Use Snowplow to collect metrics for each Secure scan

The problem

Secure groups would like relevant information about executed scans so that they can make evidence-based product decisions.

Proposal

Secure uses Sisense charts with data sourced from usage metrics to achieve insights and to make evidence-based decisions.

This issue outlines the limitations of this approach and proposes to use Snowplow events as an additional means to source metrics about Secure scans.

It is not proposed that north-star metrics are replaced by data sourced from this approach.

Why usage analytics isn't enough

Adding more metrics to usage analytics is of limited benefit due to the following limitations:

Requires heavy use of the Rails database
- Metrics are required to be stored for all of time (assuming global counts of executed scans need to be measured)
- Database performance/size limitations mean scans are only stored for default branches, thus information isn't reported for scans run on branches
- Queries are expensive and are hard to optimize
The information sent in usage pings is based on the assumption that a successful CI job means a successful scan. It also does not capture data for retried scans, or when there is more than one scan in a job
Only reports aggregated information, limiting the ability to slice up information for a sisense query/chart
There is a 38 day cycle time (time from data collected to reported)

Usage ping does have a place, as it runs on both GitLab.com and on self-managed instances.

Reference: Produce intelligence usage-ping workshop.

Proposing: record scans using a custom Snowplow event

Overview

This issue proposes that each time a Secure scan report is ingested by Rails it sends a custom event to the Snowplow collector. This solves many of the issues with usage analytics:

Data on branches can be captured as there is no requirement to have data in the database
Zero performance impact on the Rails database
Sending raw event data means it could be sliced/queried many ways in Sisense
A natural place to add more properties in future
Cycle time should be faster (to be confirmed) as sending an event is fast
Sisense charts could be based on real scan data, not assumptions based on CI job data

There are limitations:

Does not run on self-managed instances
Engineers need to be careful to not send the same metric event more than once

Possible insights

The following information could be sent with every event:

action Likely security_scanned, or some hardcoded value that is unique for secure scanners
category Likely a description of the Secure analyzer, e.g. [secure:sast:bandit]
user The user that ran the scan
project The project that the scan ran on
schema_version The schema version used by the report
start_time The UTC start time of the report
end_time The UTC end time of the report
status Whether or not the scan succeeded or failed
vendor The name of the vendor e.g. GitLab
scanner A unique identifier of the scanner e.g. zaproxy-browserker
scanner_version A string representing the version of the scanner e.g. 1.0.45

Amongst others, this would allow for the following insights:

Total number of scans run (default and branches) (GitLab.com only)
Scan duration could be measured
Failure rate could be measured (may require uploading a JSON report on scan failure)
Versions of secure reports used can help determine which should be supported
Adoption rate of new tools (e.g. Browserker) can be measured
Understand how many clients are pinning to specific versions of analyzers
Project/user allows the data to be sliced/grouped by each

In future, more properties could be added to this event. Information provided by this event is likely able to be sourced now as most Secure analyzers include the scan field in outputted JSON reports.

Reference: Produce intelligence snowplow workshop.

Implementation details

Would use ::Gitlab::Tracking.event with a custom event schema that would need to be defined in https://gitlab.com/gitlab-org/iglu.

Questions

What is the cycle time for Snowplow events?
- "Real time" (see comment in video)
What happens to the data after the event is sent to Snowplow?
- Documented at snowplow-request-flow
How can we query for event data in Sisense?
- this may be a blocker
- Asked in slack
- Could potentially have a query such as WHERE contexts LIKE '%UNIQUE_PROPERTY_NAME%'
- The issue #299354 (closed) will convert all context properties to top level fields, which will make this much easier to query
How do we build a custom JSON iglu schema for use as a custom event?
- Example schema
- See Adding new schema
Should we use a self-describing event or a custom context?
- Survey Responses is a good example, it looks like custom context will be enough
Need to understand if category can be broken down by :, for example, can we query sast and bandit separately for the category [secure:sast:bandit]?
- Just use a LIKE query, though we'll likely have the scan type information in property.
Is this approach resilient to Sidekiq jobs being retried, or is there some kind of deduplication happening in Snowplow? See comment
- No, it's not resilient to retries/duplicate jobs. Snowplow does not deduplicate events. This will have to be handled with an extra key idempotency_key, and each query will have to filter them out.
- Example sql to filter out duplicate rows:
```
WITH summary
     AS (SELECT t.name,
                t.version,
                Row_number() OVER(partition BY t.idempotency_key) AS rank
         FROM   analytics_table t)
SELECT *
FROM   summary
WHERE  rank = 1; 
```
How can we add scan type (SAST, DAST etc) specific metrics to this tracked event?
- Add another field with scan type
Are there any privacy concerns using namespace/project ID?
- Yes. The product intelligence team is waiting on a privacy policy update (see comment in video)
- For now, we pass through namespace/project ID and the collector in Rails will discard it.

Additional fields to parse

Fields to track

Architectural Support

Reminder: 72-hour SLA (note: not fussed about this requirement)
Due Date: n/a
DRI: @cam_swords

Scope Checklist

Does not involve architectural decisions
Is after-the-fact
Is not already covered by architecture guidelines/handbook
Has a broad impact within #secure
Is a new unit of work
Is strictly #secure
Could not come to an agreement (escalation)
Involves architectural decisions See the scope scoring table below to interpret the checkboxes above

Scope Scoring Table

| Reason | in | opt-in | out | |------------------------------------------------------------|:---:|:------:|:---:| | Does not involve architectural decisions | | | ❌ | | Is after-the-fact | | | ❌ | | Is not already covered by architecture guidelines/handbook | ❌ | ❌ | | | Has a broad impact within Secure | ❌ | | | | Is a new unit of work | ❌ | ❌ | | | Is strictly Secure | ❌ | ❌ | | | Could not come to an agreement (escalation) | | ? | | | Involves architectural decisions | ❌ | ❌ | |

Reviewed by

Edited Jul 02, 2021 by Cameron Swords