fix(indexer): bound system_notes backfill join to the page to avoid datalake OOM
[skip schema-version-check]
What does this MR do and why?
Bounds the SystemNote extract query's metadata join to the page to prevent datalake OOM on large namespaces.
The SystemNote extract query joined siphon_notes to siphon_system_note_metadata inline, but only the notes side was namespace-scoped; the metadata side was unbounded. ClickHouse builds the hash table over the entire siphon_system_note_metadata table (FillingRightJoinSide) before ORDER BY/LIMIT/streaming, so peak server memory scaled with the global table — crossing the 108 GiB ceiling on gitlab-org.
Fix: Separate the base table scan from the metadata enrichment using the existing _batch CTE pattern (same as standalone-edge enrichment). The base query (siphon_notes) is wrapped in a _batch CTE with namespace/watermark/cursor/LIMIT scoping. The metadata table is then enriched via an _e0 CTE scoped to note_id IN (SELECT DISTINCT id FROM _batch), so ClickHouse never materializes a hash table over the full metadata table.
Before (unbounded INNER JOIN above LIMIT):
SELECT sn.*, snm.action
FROM siphon_notes AS sn
INNER JOIN siphon_system_note_metadata AS snm ON sn.id = snm.note_id
WHERE startsWith(sn.traversal_path, ...) AND ...
ORDER BY traversal_path, id LIMIT {batch_size}After (page-bounded CTE enrichment):
WITH _batch AS (
SELECT sn.* FROM siphon_notes AS sn
WHERE startsWith(sn.traversal_path, ...) AND ...
ORDER BY traversal_path, id LIMIT {batch_size}
), _e0 AS (
SELECT snm.note_id AS id, argMax(snm.action, snm._siphon_replicated_at) AS action
FROM siphon_system_note_metadata AS snm
WHERE snm.note_id IN (SELECT DISTINCT id FROM _batch)
AND snm._siphon_deleted = false
GROUP BY snm.note_id
)
SELECT _batch.*, _e0.action FROM _batch LEFT JOIN _e0 ON _batch.id = _e0.idSchema version check skip justification
This change is query-execution-shape only — the emitted system-note edges are functionally identical (argMax dedup is equivalent-or-more-correct than the old un-FINAL INNER JOIN). No materialized graph data changes, so no reindex/migration is warranted. A SCHEMA_VERSION bump would trigger the schema-version readiness gate + migration orchestrator → a full reindex/backfill, which is the exact heavy datalake operation this MR exists to make survivable.
Context: system_notes is currently disabled in orbit-prd (argocd/apps!2196) precisely because of this OOM; that MR's stated follow-up is "optimize the systemnote datalake query before re-enabling." This MR IS that follow-up. Re-enabling will run the (now-bounded) backfill under the existing schema version.
Related Issues
Closes #830 (closed)
Testing
Unit / query-shape tests:
- Updated
build_plans_wires_system_note_derived_entity_as_extract_only_planto assert the bounded CTE shape (no INNER JOIN above LIMIT,note_id IN (SELECT DISTINCT id FROM _batch)present). - Added
system_note_extract_bounds_metadata_join_to_pageasserting:_batchCTE wraps only the basesiphon_notesscan, enrichment CTE scopes metadata to the page,LEFT JOIN _e0projectsaction. every_plan_renders_valid_sqlcontinues to pass for all plans.cargo test -p indexer --lib: 367 tests pass.cargo test -p ontology --lib: 118 tests pass.cargo clippy -p indexer -p ontology --all-features -- -D warnings: zero warnings.cargo fmt -- --check: clean.- Ontology schema validation (
check-jsonschema): all YAML files pass.
Smaller-scale ClickHouse validation (testcontainer):
Seeded a ClickHouse container with 150 notes (100 in namespace 1/2/, 50 in 3/4/) and 1,150 metadata rows (150 matching + 1,000 orphan global rows). Ran old vs new queries with LIMIT 20:
| Metric | OLD (INNER JOIN) | NEW (CTE bounded) |
|---|---|---|
read_rows |
1,250 | 300 |
read_bytes |
39,314 | 14,155 |
result_rows |
20 | 20 |
| Results | Identical | Identical |
Prod-scale 108 GiB repro isn't possible locally. The bound holds at scale because the _batch CTE limits the base scan to batch_size rows before any join, and the _e0 CTE reads only metadata matching the page's note IDs — memory scales with the page, not the global table.
Performance Analysis
- This merge request does not introduce any performance regression. If a performance regression is expected, explain why.
This MR strictly improves performance: the metadata join build side shrinks from the entire global siphon_system_note_metadata table to at most batch_size rows.
Agent context — long-form analysis, file-by-file walkthroughs, profiler output, alternatives considered
Root cause analysis (from #830 (closed)):
The SystemNote derived entity used EtlConfig::Query with from: "siphon_notes AS sn INNER JOIN siphon_system_note_metadata AS snm ON sn.id = snm.note_id". This inlined the metadata join above the LIMIT, so ClickHouse's hash-join planner builds the hash table over the entire right side (siphon_system_note_metadata) in FillingRightJoinSide before producing any output rows. The ORDER BY/LIMIT/cursor pagination only bound the output, not the input.
Fix options considered:
- Option A (chosen): Page-bounded CTE. Wraps the base scan in a
_batchCTE with all scoping inside, then enriches viaIN (SELECT id FROM _batch). Reuses the existingEnrichmentSql/render_cte_templatepattern. - Option B: Range-partition. Doesn't bound the metadata build.
- Option C: Spill/memory settings. Targets sort/group-by, not hash-join build. Dropped from this MR after review — the OOM was structural, not a memory-cap issue.
- Option D: ClickHouse dictionaries (#828). Follow-up.
Changes by file:
config/ontology/derived/core/system_note.yaml— Splitfromto baresiphon_notes AS sn, addedpage_joinconfig for bounded metadata enrichment.crates/ontology/src/etl.rs— AddedPageJoinstruct with configurablewatermarkfield.crates/ontology/src/loading/node.rs— AddedPageJoinYamldeserialization.config/schemas/ontology.schema.json— Addedpage_joinobject schema.crates/indexer/src/modules/sdlc/plan/input.rs— SynthesizesEnrichmentSqlfrompage_join.crates/indexer/src/modules/sdlc/plan/lower.rs— Added bounded join shape test.crates/indexer/src/modules/sdlc/mod.rs— Updated SystemNote plan test.
%{all_commits}