feat(query): denormalize node properties onto edge tables
Problem
Filtered aggregation queries hit a performance cliff on dense node tables. A query like:
{
"nodes": [
{"id": "pipe", "entity": "Pipeline", "filters": {"status": "failed"}},
{"id": "proj", "entity": "Project", "node_ids": [278964]}
],
"relationships": [{"type": "IN_PROJECT", "from": "pipe", "to": "proj"}],
"aggregations": [{"function": "count", "target": "pipe"}]
}generates a _nf_pipe CTE that scans the full gl_pipeline table (~4M rows for gitlab-org/gitlab), deduplicates via LIMIT 1 BY id, then feeds matching IDs into the edge scan via IN subquery. The node-table scan dominates at ~14s. This pattern affects all query types: traversals, aggregations, path finding, and neighbors.
Solution
Denormalize hot node properties onto edge tables as prefix-encoded tags in Array columns. When Pipeline.status lives on gl_edge as 'status:failed' inside source_tags, the compiler eliminates the _nf_ CTE entirely and filters directly on the edge table:
-- Before: ~14s
WITH _nf_pipe AS (
SELECT id FROM gl_pipeline WHERE status = 5
ORDER BY traversal_path, id, _version LIMIT 1 BY id
)
SELECT COUNT(e0.source_id) FROM gl_edge e0
WHERE e0.source_id IN (SELECT id FROM _nf_pipe)
AND e0.target_id = {p0:Int64}
-- After: ~100-200ms
SELECT COUNT() FROM gl_edge e0
WHERE hasToken(e0.source_tags, {p0:String})
AND e0.source_kind = 'Pipeline'
AND e0.target_id = {p1:Int64}Schema (ClickHouse 26.2+)
2 Array(LowCardinality(String)) columns with text(tokenizer='array') indexes:
source_tags Array(LowCardinality(String)),
target_tags Array(LowCardinality(String)),
INDEX source_tags_idx source_tags TYPE text(tokenizer = 'array') GRANULARITY 64,
INDEX target_tags_idx target_tags TYPE text(tokenizer = 'array') GRANULARITY 64,Values are prefix-encoded: 'status:failed', 'state:merged', 'severity:critical'. The text index provides Roaring Bitmap posting lists with direct-read mode and granule-level pruning. Adding new properties is zero schema changes -- just new tag values.
How it works
Three layers, all ontology-driven:
1. Central denormalization list declares which properties to materialize:
# config/ontology/schema.yaml
settings:
denormalization:
- {node: Pipeline, property: status}
- {node: MergeRequest, property: state}
- {node: WorkItem, property: state}
- {node: WorkItem, property: work_item_type, as: wi_type}
- {node: Vulnerability, property: state}
- {node: Vulnerability, property: severity}
- {node: Job, property: status}
- {node: Deployment, property: status}
- {node: Finding, property: severity}The optional as field overrides the tag key (default: property name). The ontology loader auto-derives source_tags/target_tags columns and text indexes from this list.
2. Indexer populates the arrays at edge-write time.
For FK edges, the parent node's properties are already in the MemTable. The lowering builds tagged array values with enum conversion:
-- FK edge transform (Pipeline → IN_PROJECT → Project):
make_array(
concat('status:', CASE WHEN status = 0 THEN 'created'
WHEN status = 4 THEN 'failed' ... ELSE CAST(status AS VARCHAR) END)
) AS source_tagsFor standalone edges, endpoints declare which columns to enrich from the node's datalake table:
# config/ontology/edges/approved.yaml
etl:
- scope: namespaced
source: siphon_approvals
order_by: [traversal_path, merge_request_id, id]
from:
id: user_id
type: User
to:
id: merge_request_id
type: MergeRequest
enrich: [state_id] # denorm: MR stateThe enrich field triggers a deduplicating LEFT JOIN at extract time. The central denormalization config in schema.yaml remains the SSOT for what gets projected onto edge rows.
3. Compiler rewrites queries. The rewrite_denormalized_node_filters optimization pass identifies _nf_ CTEs where all filters reference denormalized properties, removes the CTE, and emits hasToken() filters on the edge array column:
-- Single filter:
WHERE hasToken(e0.source_tags, 'status:failed')
-- Multi-filter (same side, AND):
WHERE hasToken(e0.source_tags, 'state:detected')
AND hasToken(e0.source_tags, 'severity:critical')The pass handles all query types:
| Query Type | How the rewrite works |
|---|---|
| Traversal (1-hop) | Removes _nf_ CTE, adds hasToken to outer WHERE |
| Traversal (multi-hop) | Removes _nf_ CTE, injects hasToken into each UNION ALL arm |
| Aggregation | Same as traversal; edge-only targets also get bare COUNT() |
| Aggregation GROUP BY | Rewrites GROUP BY node.prop to GROUP BY edge tag column, prunes node JOIN |
| PathFinding | Removes _nf_ anchor CTE, injects filter into frontier arms |
| Neighbors | Per-direction hasToken filters for outgoing/incoming arms |
Only eq filters are supported for denorm rewrites. Range/LIKE filters bail and keep the _nf_ CTE.
Adding a new denormalized property
- Add to the central list in
schema.yaml:
denormalization:
- {node: EntityName, property: property_name}- For standalone edges that need it, add
enrichto the relevant endpoint:
to:
id: entity_id
type: EntityName
enrich: [source_column_name]- Bump
SCHEMA_VERSION, runmise schema:generate:ddl. No new columns, indexes, or projections needed.
Coverage
All properties write to source_tags / target_tags as prefix-encoded values:
| Tag key | Entities | Example value |
|---|---|---|
status |
Pipeline, Job, Deployment | status:failed |
state |
MergeRequest, WorkItem, Vulnerability | state:merged |
severity |
Vulnerability, Finding | severity:critical |
wi_type |
WorkItem | wi_type:issue |
FK edges: populated from parent node MemTable (zero overhead). Standalone edges with enrich: populated via deduplicating LEFT JOIN at extract time. Code edges: empty arrays.
Staleness
Denormalized values update when the edge is re-indexed. FK edges re-index atomically with their parent node (same extract batch). Standalone edges re-index when their join table receives a CDC event.
Schema
Bumps SCHEMA_VERSION to 18. Adds 2 Array(LowCardinality(String)) columns + 2 text(tokenizer='array') indexes to both gl_edge and gl_code_edge.