Skip to content

Fix Security features tables cross-joining `ci_builds` -> Remove `[analyzer]_scans` metrics join to ci_builds

From !62092 (comment 626027123) we see that Security::Scan.latest_successful_by_build requires joining from non ci_* tables to ci_* tables. There is also a similar join happening in UsageData#count_secure_scans. When we extract ci_* tables to a new database this will not be possible.

As such we need to find an alternative approach.

UsageData metrics

The specific metrics are:

config/metrics/counts_28d/20210216183832_dast_scans.yml
3:key_path: usage_activity_by_stage_monthly.secure.dast_scans

config/metrics/counts_28d/20210216183836_coverage_fuzzing_scans.yml
3:key_path: usage_activity_by_stage_monthly.secure.coverage_fuzzing_scans

config/metrics/counts_28d/20210216183830_container_scanning_scans.yml
3:key_path: usage_activity_by_stage_monthly.secure.container_scanning_scans

config/metrics/counts_28d/20210216183826_sast_scans.yml
3:key_path: usage_activity_by_stage_monthly.secure.sast_scans

config/metrics/counts_28d/20210216183834_secret_detection_scans.yml
3:key_path: usage_activity_by_stage_monthly.secure.secret_detection_scans

config/metrics/counts_28d/20210216183838_api_fuzzing_scans.yml
3:key_path: usage_activity_by_stage_monthly.secure.api_fuzzing_scans

Options

  1. Is it possible to remove or somehow refactor this code so that this isn't necessary?
  2. Is it possible to de-normalize the ci_builds.status column onto the security_scans model so that no join is necessary? This may be an issue if the status changes after we create security_scans record as we'll also have to keep this up to date.
  3. Is there a performant way for us to do this in 2 separate queries and join in memory?
  4. Is there a new table we could create to store the relevant data we need to construct these queries? It will be de-normalized and storing redundant info that is on the other tables, can be written to only after the data needs to be stored (eg. after a build finishes) and can be written in sidekiq and then it will allow much faster querying for the specific data we need for the feature.

Sharding team recommendation

It appears to me these 2 joins are only used for UsageData and it's only used to avoid double counting for retried CI builds. It appears we should just replace this with the double-counted version which isn't necessarily a worse metric anyway.

Also I think using count distinct by build_id could also solve the double counting problem except it will count in-progress or failed jobs.

In any case the recommended approach will be simplify the metrics even if it means losing some level of detail and approach the metric again considering the limitations of the 2 separate databases.

Proposal

These metrics should report for all scans independent of build status or build retried status. This will remove the required join on the ci_builds table.

Edited by Cameron Swords