Bug: Namespace vulnerability statistics schedule worker missing ids

Background:

Current Logic

The schedule worker uses somewhat odd logic to efficiently batch namespaces with the following considerations:

Namespaces table holds records for both projects and namespaces
~10% of the records in the table are groups
~1.5% of the groups have vulnerabilities
To avoid inefficient batching over all group ids, we want check if a group has vulnerabilities before sending it to the calculation service

High-level logic:

Iterate over namespaces in batches
For each batch, select the id and traversal_ids
Select all the traversal_ids from vulnerability_statistics that appeared in this batch
Filter the id values based on whether their matching traversal_ids was found in vulnerability_statistics
Aggregate filtered id values and schedule batch processing

The Problem

To avoid recalculating statistics over and over for the same group, we only take a group id if its matching traversal_id explicitly appeared in vulnerability_statistics. This causes us to skip and miss groups where their traversal_ids are a prefix of an existing record.

Example:

vulnerability_statistics table:

id	project_id	total	critical	traversal_id
1	30	5	5	{20,30,40}

namespaces table:

id	name	traversal_id
20	group_20	{20}
30	group_30	{20,30}
40	group_40	{20,30,40}

In this case, we will only pass id = 40 to the batch processing, even though all of the groups should have their statistics calculated.

Edited Apr 22, 2025 by Gal Katz