Use Snowplow to collect metrics for each Secure scan

The problem

Secure groups would like relevant information about executed scans so that they can make evidence-based product decisions.

Proposal

Secure uses Sisense charts with data sourced from usage metrics to achieve insights and to make evidence-based decisions.

This issue outlines the limitations of this approach and proposes to use Snowplow events as an additional means to source metrics about Secure scans.

It is not proposed that north-star metrics are replaced by data sourced from this approach.

Why usage analytics isn't enough

Adding more metrics to usage analytics is of limited benefit due to the following limitations:

  • Requires heavy use of the Rails database
    • Metrics are required to be stored for all of time (assuming global counts of executed scans need to be measured)
    • Database performance/size limitations mean scans are only stored for default branches, thus information isn't reported for scans run on branches
    • Queries are expensive and are hard to optimize
  • The information sent in usage pings is based on the assumption that a successful CI job means a successful scan. It also does not capture data for retried scans, or when there is more than one scan in a job
  • Only reports aggregated information, limiting the ability to slice up information for a sisense query/chart
  • There is a 38 day cycle time (time from data collected to reported)

Usage ping does have a place, as it runs on both GitLab.com and on self-managed instances.

Reference: Produce intelligence usage-ping workshop.

Proposing: record scans using a custom Snowplow event

Overview

This issue proposes that each time a Secure scan report is ingested by Rails it sends a custom event to the Snowplow collector. This solves many of the issues with usage analytics:

  • Data on branches can be captured as there is no requirement to have data in the database
  • Zero performance impact on the Rails database
  • Sending raw event data means it could be sliced/queried many ways in Sisense
  • A natural place to add more properties in future
  • Cycle time should be faster (to be confirmed) as sending an event is fast
  • Sisense charts could be based on real scan data, not assumptions based on CI job data

There are limitations:

  • Does not run on self-managed instances
  • Engineers need to be careful to not send the same metric event more than once

Possible insights

The following information could be sent with every event:

  • action Likely security_scanned, or some hardcoded value that is unique for secure scanners
  • category Likely a description of the Secure analyzer, e.g. [secure:sast:bandit]
  • user The user that ran the scan
  • project The project that the scan ran on
  • schema_version The schema version used by the report
  • start_time The UTC start time of the report
  • end_time The UTC end time of the report
  • status Whether or not the scan succeeded or failed
  • vendor The name of the vendor e.g. GitLab
  • scanner A unique identifier of the scanner e.g. zaproxy-browserker
  • scanner_version A string representing the version of the scanner e.g. 1.0.45

Amongst others, this would allow for the following insights:

  • Total number of scans run (default and branches) (GitLab.com only)
  • Scan duration could be measured
  • Failure rate could be measured (may require uploading a JSON report on scan failure)
  • Versions of secure reports used can help determine which should be supported
  • Adoption rate of new tools (e.g. Browserker) can be measured
  • Understand how many clients are pinning to specific versions of analyzers
  • Project/user allows the data to be sliced/grouped by each

In future, more properties could be added to this event. Information provided by this event is likely able to be sourced now as most Secure analyzers include the scan field in outputted JSON reports.

Reference: Produce intelligence snowplow workshop.

Implementation details

Would use ::Gitlab::Tracking.event with a custom event schema that would need to be defined in https://gitlab.com/gitlab-org/iglu.

Questions

  • What is the cycle time for Snowplow events?
    • "Real time" (see comment in video)
  • What happens to the data after the event is sent to Snowplow?
  • How can we query for event data in Sisense?
    • this may be a blocker
    • Asked in slack
    • Could potentially have a query such as WHERE contexts LIKE '%UNIQUE_PROPERTY_NAME%'
    • The issue #299354 (closed) will convert all context properties to top level fields, which will make this much easier to query
  • How do we build a custom JSON iglu schema for use as a custom event?
  • Should we use a self-describing event or a custom context?
    • Survey Responses is a good example, it looks like custom context will be enough
  • Need to understand if category can be broken down by :, for example, can we query sast and bandit separately for the category [secure:sast:bandit]?
    • Just use a LIKE query, though we'll likely have the scan type information in property.
  • Is this approach resilient to Sidekiq jobs being retried, or is there some kind of deduplication happening in Snowplow? See comment
    • No, it's not resilient to retries/duplicate jobs. Snowplow does not deduplicate events. This will have to be handled with an extra key idempotency_key, and each query will have to filter them out.
    • Example sql to filter out duplicate rows:
      WITH summary
           AS (SELECT t.name,
                      t.version,
                      Row_number() OVER(partition BY t.idempotency_key) AS rank
               FROM   analytics_table t)
      SELECT *
      FROM   summary
      WHERE  rank = 1; 
  • How can we add scan type (SAST, DAST etc) specific metrics to this tracked event?
    • Add another field with scan type
  • Are there any privacy concerns using namespace/project ID?
    • Yes. The product intelligence team is waiting on a privacy policy update (see comment in video)
    • For now, we pass through namespace/project ID and the collector in Rails will discard it.

Additional fields to parse

  • analyzer.id
  • analyzer.name
  • analyzer.version
  • analyzer.vendor.name
  • version (schema version in report)
  • scanner.version

Fields to track

  • category secure:scan
  • action scan
  • triggered_by The id of the user that ran the scan
  • project The id of the project that the scan ran on
  • report_schema_version The schema version used by the JSON report
  • start_time The UTC start time of the report (if present)
  • end_time The UTC end time of the report (if present)
  • status Whether or not the scan succeeded or failed (if present)
  • scan_type sast, dast, api-fuzzing, etc.
  • scanner A unique identifier of the scanner e.g. zaproxy-browserker (if present)
  • scanner_vendor The name of the vendor of the scanner e.g. ZAP (if present)
  • scanner_version A string representing the version of the scanner e.g. 2.10.0 (if present)
  • analyzer A unique identifier of the analyzer e.g. gitlab-dast (if present)
  • analyzer_vendor The name of the vendor of the scanner e.g. GitLab (if present)
  • analyzer_version A string representing the version of the scanner e.g. 1.50.2 (if present)

Architectural Support

  • Reminder: 72-hour SLA (note: not fussed about this requirement)
  • Due Date: n/a
  • DRI: @cam_swords

Scope Checklist

  • Does not involve architectural decisions
  • Is after-the-fact
  • Is not already covered by architecture guidelines/handbook
  • Has a broad impact within #secure
  • Is a new unit of work
  • Is strictly #secure
  • Could not come to an agreement (escalation)
  • Involves architectural decisions See the scope scoring table below to interpret the checkboxes above
Scope Scoring Table

| Reason | in | opt-in | out | |------------------------------------------------------------|:---:|:------:|:---:| | Does not involve architectural decisions | | | | | Is after-the-fact | | | | | Is not already covered by architecture guidelines/handbook | | | | | Has a broad impact within Secure | | | | | Is a new unit of work | | | | | Is strictly Secure | | | | | Could not come to an agreement (escalation) | | ? | | | Involves architectural decisions | | | |

Reviewed by

Edited by Cameron Swords