Skip to content

Draft: Support group archive in analyzer namespace status adjustment service

What does this MR do and why?

Adds support for group archive in analyzer namespace status adjustment service.

Changelog: fixed
EE: true

The group archive feature adds a group-level archive flag that affects all descendant groups and projects. This breaks the existing logic in AnalyzerNamespaceStatuses::AdjustmentService, which aggregates project-level records based on the traversal_ids of the namespace in context.

To resolve this, I tried excluding project-level records that have an archived namespace_id anywhere in their traversal_ids array.

Example:

Consider this namespace structure where group [1,2] is archived and we should ignore the dotted projects [1,2,4], [1,2,5] when calculating the statistics for [1]:

graph TD
    A["[1]"]
    
    A --> B["[1,1]"]
    A --> C["[1,2] - Archived"]
    A --> D["[1,3]"]
    
    B --> E["[1,1,4]"]
    B --> F["[1,1,5]"]
    
    C --> G["[1,2,4]"]
    C --> H["[1,2,5]"]
    
    D --> I["[1,3,4]"]
    D --> J["[1,3,5]"]
    
    classDef dotted stroke-dasharray: 5 5
    class G,H dotted

The fix adds an array overlap check to the WHERE clause:

AND NOT (analyzer_project_statuses.traversal_ids && ARRAY[<ARCHIVED_NAMESPACE_IDS>]::bigint[])

Where ARCHIVED_NAMESPACE_IDS is all the traversal_ids of the namespaces in this batch.

Query and plan:

Query
WITH new_values AS (
    WITH namespace_data (
        namespace_id,
        traversal_ids,
        next_traversal_id
) AS (
        VALUES (98867842, ARRAY[9970, 96981785, 97058478, 98867842]::bigint[], ARRAY[9970, 96981785, 97058478, 98867843]::bigint[]))
        SELECT
            analyzer_project_statuses.analyzer_type,
            COALESCE(SUM((status = 1)::int), 0) AS success,
            COALESCE(SUM((status = 2)::int), 0) AS failure,
            namespace_data.traversal_ids AS traversal_ids,
            namespace_data.namespace_id AS namespace_id,
            now() AS created_at,
            now() AS updated_at
        FROM
            namespace_data
            LEFT JOIN analyzer_project_statuses ON analyzer_project_statuses.archived = FALSE
                AND analyzer_project_statuses.traversal_ids >= namespace_data.traversal_ids
                AND analyzer_project_statuses.traversal_ids < namespace_data.next_traversal_id
                AND NOT (analyzer_project_statuses.traversal_ids && ARRAY[231, 222, 96981785]::bigint[])
        GROUP BY
            namespace_data.traversal_ids,
            namespace_id,
            analyzer_project_statuses.analyzer_type
        HAVING
            count(analyzer_project_statuses.analyzer_type) > 0
),
old_values AS (
    SELECT
        namespace_id,
        traversal_ids,
        analyzer_type,
        success,
        failure
    FROM
        analyzer_namespace_statuses
    WHERE
        namespace_id IN (
            SELECT
                namespace_id
            FROM
                new_values)
),
upserted AS (
INSERT INTO analyzer_namespace_statuses (analyzer_type, success, failure, traversal_ids, namespace_id, created_at, updated_at) (
        SELECT
            analyzer_type,
            success,
            failure,
            traversal_ids,
            namespace_id,
            created_at,
            updated_at
        FROM
            new_values)
    ON CONFLICT (namespace_id,
        analyzer_type)
        DO UPDATE SET
            success = EXCLUDED.success, failure = EXCLUDED.failure, updated_at = EXCLUDED.updated_at, traversal_ids = EXCLUDED.traversal_ids
        RETURNING
            namespace_id
),
diff_values AS (
    SELECT
        new_values.namespace_id AS namespace_id,
        new_values.traversal_ids,
        new_values.analyzer_type,
        new_values.success - COALESCE(old_values.success, 0) AS success,
        new_values.failure - COALESCE(old_values.failure, 0) AS failure
    FROM
        new_values
        LEFT JOIN old_values ON new_values.namespace_id = old_values.namespace_id
            AND new_values.analyzer_type = old_values.analyzer_type
    WHERE
        EXISTS (
            SELECT
                1
            FROM
                upserted
            WHERE
                upserted.namespace_id = new_values.namespace_id))
SELECT
    *
FROM
    diff_values
WHERE
    success != 0
    OR failure != 0
plan

link

Nested Loop Semi Join  (cost=4.11..7.24 rows=1 width=54) (actual time=69.444..69.449 rows=0 loops=1)
   Buffers: shared hit=8 read=125 dirtied=7
   WAL: records=7 fpi=7 bytes=56355
   I/O Timings: read=68.373 write=0.000
   CTE new_values
     ->  Aggregate  (cost=3.61..3.65 rows=1 width=70) (actual time=69.442..69.444 rows=0 loops=1)
           Group Key: analyzer_project_statuses.analyzer_type
           Filter: (count(analyzer_project_statuses.analyzer_type) > 0)
           Rows Removed by Filter: 1
           Buffers: shared hit=8 read=125 dirtied=7
           WAL: records=7 fpi=7 bytes=56355
           I/O Timings: read=68.373 write=0.000
           ->  Sort  (cost=3.61..3.62 rows=1 width=4) (actual time=69.433..69.435 rows=1 loops=1)
                 Sort Key: analyzer_project_statuses.analyzer_type
                 Sort Method: quicksort  Memory: 25kB
                 Buffers: shared hit=8 read=125 dirtied=7
                 WAL: records=7 fpi=7 bytes=56355
                 I/O Timings: read=68.373 write=0.000
                 ->  Nested Loop Left Join  (cost=0.56..3.60 rows=1 width=4) (actual time=69.405..69.408 rows=1 loops=1)
                       Buffers: shared hit=5 read=125 dirtied=7
                       WAL: records=7 fpi=7 bytes=56355
                       I/O Timings: read=68.373 write=0.000
                       ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
                             I/O Timings: read=0.000 write=0.000
                       ->  Index Scan using index_analyzer_project_statuses_traversal_ids on public.analyzer_project_statuses  (cost=0.56..3.58 rows=1 width=4) (actual time=69.401..69.402 rows=0 loops=1)
                             Index Cond: ((analyzer_project_statuses.traversal_ids >= '{9970,96981785,97058478,98867842}'::bigint[]) AND (analyzer_project_statuses.traversal_ids < '{9970,96981785,97058478,98867843}'::bigint[]))
                             Filter: ((NOT analyzer_project_statuses.archived) AND (NOT (analyzer_project_statuses.traversal_ids && '{231,222,96981785}'::bigint[])))
                             Rows Removed by Filter: 162
                             Buffers: shared hit=5 read=125 dirtied=7
                             WAL: records=7 fpi=7 bytes=56355
                             I/O Timings: read=68.373 write=0.000
   CTE upserted
     ->  ModifyTable on public.analyzer_namespace_statuses analyzer_namespace_statuses_1  (cost=0.00..0.03 rows=1 width=82) (actual time=0.003..0.003 rows=0 loops=1)
           I/O Timings: read=0.000 write=0.000
           ->  CTE Scan on new_values new_values_2  (cost=0.00..0.03 rows=1 width=82) (actual time=0.001..0.001 rows=0 loops=1)
                 I/O Timings: read=0.000 write=0.000
   ->  Nested Loop Left Join  (cost=0.43..3.52 rows=1 width=70) (actual time=69.444..69.445 rows=0 loops=1)
         Filter: (((new_values.success - COALESCE(analyzer_namespace_statuses.success, '0'::bigint)) <> 0) OR ((new_values.failure - COALESCE(analyzer_namespace_statuses.failure, '0'::bigint)) <> 0))
         Rows Removed by Filter: 0
         Buffers: shared hit=8 read=125 dirtied=7
         WAL: records=7 fpi=7 bytes=56355
         I/O Timings: read=68.373 write=0.000
         ->  CTE Scan on new_values  (cost=0.00..0.02 rows=1 width=54) (actual time=69.443..69.444 rows=0 loops=1)
               Buffers: shared hit=8 read=125 dirtied=7
               WAL: records=7 fpi=7 bytes=56355
               I/O Timings: read=68.373 write=0.000
         ->  Nested Loop Semi Join  (cost=0.43..3.48 rows=1 width=26) (actual time=0.000..0.000 rows=0 loops=0)
               I/O Timings: read=0.000 write=0.000
               ->  Index Scan using index_analyzer_namespace_statuses_status on public.analyzer_namespace_statuses  (cost=0.43..3.45 rows=1 width=26) (actual time=0.000..0.000 rows=0 loops=0)
                     Index Cond: ((analyzer_namespace_statuses.namespace_id = new_values.namespace_id) AND (analyzer_namespace_statuses.analyzer_type = new_values.analyzer_type))
                     I/O Timings: read=0.000 write=0.000
               ->  CTE Scan on new_values new_values_1  (cost=0.00..0.02 rows=1 width=4) (actual time=0.000..0.000 rows=0 loops=0)
                     I/O Timings: read=0.000 write=0.000
   ->  CTE Scan on upserted  (cost=0.00..0.02 rows=1 width=8) (actual time=0.000..0.000 rows=0 loops=0)
         I/O Timings: read=0.000 write=0.000
Settings: random_page_cost = '1.5', work_mem = '100MB', effective_cache_size = '338688MB', jit = 'off', seq_page_cost = '4'

Note the addition of this filter condition:

AND (NOT (analyzer_project_statuses.traversal_ids && '{<FLATTENED_TRAVERSAL_IDS>}'::bigint[])))

Given a BATCH_SIZE of 100 namespaces and a maximum depth of 20 namespaces, is the performance impact of this array overlap check acceptable?

Related to Make Security::AnalyzersStatus::ProcessArchived... (#572476) • Gal Katz • Backlog

Edited by Gal Katz

Merge request reports

Loading