feat(query): denormalize node properties onto edge tables

Problem

Filtered aggregation queries hit a performance cliff on dense node tables. A query like:

{
  "nodes": [
    {"id": "pipe", "entity": "Pipeline", "filters": {"status": "failed"}},
    {"id": "proj", "entity": "Project", "node_ids": [278964]}
  ],
  "relationships": [{"type": "IN_PROJECT", "from": "pipe", "to": "proj"}],
  "aggregations": [{"function": "count", "target": "pipe"}]
}

generates a _nf_pipe CTE that scans the full gl_pipeline table (~4M rows for gitlab-org/gitlab), deduplicates via LIMIT 1 BY id, then feeds matching IDs into the edge scan via IN subquery. The node-table scan dominates at ~14s. This pattern affects all query types: traversals, aggregations, path finding, and neighbors.

Solution

Denormalize hot node properties onto edge tables as prefix-encoded tags in Array columns. When Pipeline.status lives on gl_edge as 'status:failed' inside source_tags, the compiler eliminates the _nf_ CTE entirely and filters directly on the edge table:

-- Before: ~14s
WITH _nf_pipe AS (
  SELECT id FROM gl_pipeline WHERE status = 5
  ORDER BY traversal_path, id, _version LIMIT 1 BY id
)
SELECT COUNT(e0.source_id) FROM gl_edge e0
WHERE e0.source_id IN (SELECT id FROM _nf_pipe)
  AND e0.target_id = {p0:Int64}

-- After: ~100-200ms
SELECT COUNT() FROM gl_edge e0
WHERE hasToken(e0.source_tags, {p0:String})
  AND e0.source_kind = 'Pipeline'
  AND e0.target_id = {p1:Int64}

Schema (ClickHouse 26.2+)

2 Array(LowCardinality(String)) columns with text(tokenizer='array') indexes:

source_tags Array(LowCardinality(String)),
target_tags Array(LowCardinality(String)),
INDEX source_tags_idx source_tags TYPE text(tokenizer = 'array') GRANULARITY 64,
INDEX target_tags_idx target_tags TYPE text(tokenizer = 'array') GRANULARITY 64,

Values are prefix-encoded: 'status:failed', 'state:merged', 'severity:critical'. The text index provides Roaring Bitmap posting lists with direct-read mode and granule-level pruning. Adding new properties is zero schema changes -- just new tag values.

How it works

Three layers, all ontology-driven:

1. Central denormalization list declares which properties to materialize:

# config/ontology/schema.yaml
settings:
  denormalization:
    - {node: Pipeline, property: status}
    - {node: MergeRequest, property: state}
    - {node: WorkItem, property: state}
    - {node: WorkItem, property: work_item_type, as: wi_type}
    - {node: Vulnerability, property: state}
    - {node: Vulnerability, property: severity}
    - {node: Job, property: status}
    - {node: Deployment, property: status}
    - {node: Finding, property: severity}

The optional as field overrides the tag key (default: property name). The ontology loader auto-derives source_tags/target_tags columns and text indexes from this list.

2. Indexer populates the arrays at edge-write time.

For FK edges, the parent node's properties are already in the MemTable. The lowering builds tagged array values with enum conversion:

-- FK edge transform (Pipeline → IN_PROJECT → Project):
make_array(
  concat('status:', CASE WHEN status = 0 THEN 'created'
    WHEN status = 4 THEN 'failed' ... ELSE CAST(status AS VARCHAR) END)
) AS source_tags

For standalone edges, endpoints declare which columns to enrich from the node's datalake table:

# config/ontology/edges/approved.yaml
etl:
  - scope: namespaced
    source: siphon_approvals
    order_by: [traversal_path, merge_request_id, id]
    from:
      id: user_id
      type: User
    to:
      id: merge_request_id
      type: MergeRequest
      enrich: [state_id]  # denorm: MR state

The enrich field triggers a deduplicating LEFT JOIN at extract time. The central denormalization config in schema.yaml remains the SSOT for what gets projected onto edge rows.

3. Compiler rewrites queries. The rewrite_denormalized_node_filters optimization pass identifies _nf_ CTEs where all filters reference denormalized properties, removes the CTE, and emits hasToken() filters on the edge array column:

-- Single filter:
WHERE hasToken(e0.source_tags, 'status:failed')

-- Multi-filter (same side, AND):
WHERE hasToken(e0.source_tags, 'state:detected')
  AND hasToken(e0.source_tags, 'severity:critical')

The pass handles all query types:

Query Type How the rewrite works
Traversal (1-hop) Removes _nf_ CTE, adds hasToken to outer WHERE
Traversal (multi-hop) Removes _nf_ CTE, injects hasToken into each UNION ALL arm
Aggregation Same as traversal; edge-only targets also get bare COUNT()
Aggregation GROUP BY Rewrites GROUP BY node.prop to GROUP BY edge tag column, prunes node JOIN
PathFinding Removes _nf_ anchor CTE, injects filter into frontier arms
Neighbors Per-direction hasToken filters for outgoing/incoming arms

Only eq filters are supported for denorm rewrites. Range/LIKE filters bail and keep the _nf_ CTE.

Adding a new denormalized property

  1. Add to the central list in schema.yaml:
denormalization:
  - {node: EntityName, property: property_name}
  1. For standalone edges that need it, add enrich to the relevant endpoint:
to:
  id: entity_id
  type: EntityName
  enrich: [source_column_name]
  1. Bump SCHEMA_VERSION, run mise schema:generate:ddl. No new columns, indexes, or projections needed.

Coverage

All properties write to source_tags / target_tags as prefix-encoded values:

Tag key Entities Example value
status Pipeline, Job, Deployment status:failed
state MergeRequest, WorkItem, Vulnerability state:merged
severity Vulnerability, Finding severity:critical
wi_type WorkItem wi_type:issue

FK edges: populated from parent node MemTable (zero overhead). Standalone edges with enrich: populated via deduplicating LEFT JOIN at extract time. Code edges: empty arrays.

Staleness

Denormalized values update when the edge is re-indexed. FK edges re-index atomically with their parent node (same extract batch). Standalone edges re-index when their join table receives a CDC event.

Schema

Bumps SCHEMA_VERSION to 18. Adds 2 Array(LowCardinality(String)) columns + 2 text(tokenizer='array') indexes to both gl_edge and gl_code_edge.

Edited by Michael Usachenko

Merge request reports

Loading