Bug: Namespace vulnerability statistics schedule worker missing ids
Background:
Current Logic
The schedule worker uses somewhat odd logic to efficiently batch namespaces with the following considerations:
- Namespaces table holds records for both projects and namespaces
- ~10% of the records in the table are groups
- ~1.5% of the groups have vulnerabilities
- To avoid inefficient batching over all group ids, we want check if a group has vulnerabilities before sending it to the calculation service
High-level logic:
- Iterate over namespaces in batches
- For each batch, select the
idandtraversal_ids - Select all the
traversal_idsfromvulnerability_statisticsthat appeared in this batch - Filter the
idvalues based on whether their matchingtraversal_idswas found invulnerability_statistics - Aggregate filtered
idvalues and schedule batch processing
The Problem
To avoid recalculating statistics over and over for the same group, we only take a group id if its matching traversal_id explicitly appeared in vulnerability_statistics. This causes us to skip and miss groups where their traversal_ids are a prefix of an existing record.
Example:
vulnerability_statistics table:
| id | project_id | total | critical | traversal_id |
|---|---|---|---|---|
| 1 | 30 | 5 | 5 | {20,30,40} |
namespaces table:
| id | name | traversal_id |
|---|---|---|
| 20 | group_20 | {20} |
| 30 | group_30 | {20,30} |
| 40 | group_40 | {20,30,40} |
In this case, we will only pass id = 40 to the batch processing, even though all of the groups should have their statistics calculated.
Edited by Gal Katz