Fix Security features tables cross-joining `ci_builds` -> Remove `[analyzer]_scans` metrics join to ci_builds
From !62092 (comment 626027123) we see that Security::Scan.latest_successful_by_build requires joining from non ci_*
tables to ci_*
tables. There is also a similar join happening in UsageData#count_secure_scans
. When we extract ci_*
tables to a new database this will not be possible.
As such we need to find an alternative approach.
UsageData metrics
The specific metrics are:
config/metrics/counts_28d/20210216183832_dast_scans.yml
3:key_path: usage_activity_by_stage_monthly.secure.dast_scans
config/metrics/counts_28d/20210216183836_coverage_fuzzing_scans.yml
3:key_path: usage_activity_by_stage_monthly.secure.coverage_fuzzing_scans
config/metrics/counts_28d/20210216183830_container_scanning_scans.yml
3:key_path: usage_activity_by_stage_monthly.secure.container_scanning_scans
config/metrics/counts_28d/20210216183826_sast_scans.yml
3:key_path: usage_activity_by_stage_monthly.secure.sast_scans
config/metrics/counts_28d/20210216183834_secret_detection_scans.yml
3:key_path: usage_activity_by_stage_monthly.secure.secret_detection_scans
config/metrics/counts_28d/20210216183838_api_fuzzing_scans.yml
3:key_path: usage_activity_by_stage_monthly.secure.api_fuzzing_scans
Options
- Is it possible to remove or somehow refactor this code so that this isn't necessary?
- Is it possible to de-normalize the
ci_builds.status
column onto thesecurity_scans
model so that no join is necessary? This may be an issue if the status changes after we createsecurity_scans
record as we'll also have to keep this up to date. - Is there a performant way for us to do this in 2 separate queries and join in memory?
- Is there a new table we could create to store the relevant data we need to construct these queries? It will be de-normalized and storing redundant info that is on the other tables, can be written to only after the data needs to be stored (eg. after a build finishes) and can be written in sidekiq and then it will allow much faster querying for the specific data we need for the feature.
Sharding team recommendation
It appears to me these 2 joins are only used for UsageData
and it's only used to avoid double counting for retried CI builds. It appears we should just replace this with the double-counted version which isn't necessarily a worse metric anyway.
Also I think using count distinct by build_id
could also solve the double counting problem except it will count in-progress or failed jobs.
In any case the recommended approach will be simplify the metrics even if it means losing some level of detail and approach the metric again considering the limitations of the 2 separate databases.
Proposal
These metrics should report for all scans independent of build status or build retried status. This will remove the required join on the ci_builds
table.