fix(indexer): bound system_notes backfill join to the page to avoid datalake OOM

[skip schema-version-check]

What does this MR do and why?

Bounds the SystemNote extract query's metadata join to the page to prevent datalake OOM on large namespaces.

The SystemNote extract query joined siphon_notes to siphon_system_note_metadata inline, but only the notes side was namespace-scoped; the metadata side was unbounded. ClickHouse builds the hash table over the entire siphon_system_note_metadata table (FillingRightJoinSide) before ORDER BY/LIMIT/streaming, so peak server memory scaled with the global table — crossing the 108 GiB ceiling on gitlab-org.

Fix: Separate the base table scan from the metadata enrichment using the existing _batch CTE pattern (same as standalone-edge enrichment). The base query (siphon_notes) is wrapped in a _batch CTE with namespace/watermark/cursor/LIMIT scoping. The metadata table is then enriched via an _e0 CTE scoped to note_id IN (SELECT DISTINCT id FROM _batch), so ClickHouse never materializes a hash table over the full metadata table.

Before (unbounded INNER JOIN above LIMIT):

SELECT sn.*, snm.action
FROM siphon_notes AS sn
INNER JOIN siphon_system_note_metadata AS snm ON sn.id = snm.note_id
WHERE startsWith(sn.traversal_path, ...) AND ...
ORDER BY traversal_path, id LIMIT {batch_size}

After (page-bounded CTE enrichment):

WITH _batch AS (
  SELECT sn.* FROM siphon_notes AS sn
  WHERE startsWith(sn.traversal_path, ...) AND ...
  ORDER BY traversal_path, id LIMIT {batch_size}
), _e0 AS (
  SELECT snm.note_id AS id, argMax(snm.action, snm._siphon_replicated_at) AS action
  FROM siphon_system_note_metadata AS snm
  WHERE snm.note_id IN (SELECT DISTINCT id FROM _batch)
    AND snm._siphon_deleted = false
  GROUP BY snm.note_id
)
SELECT _batch.*, _e0.action FROM _batch LEFT JOIN _e0 ON _batch.id = _e0.id

Schema version check skip justification

This change is query-execution-shape only — the emitted system-note edges are functionally identical (argMax dedup is equivalent-or-more-correct than the old un-FINAL INNER JOIN). No materialized graph data changes, so no reindex/migration is warranted. A SCHEMA_VERSION bump would trigger the schema-version readiness gate + migration orchestrator → a full reindex/backfill, which is the exact heavy datalake operation this MR exists to make survivable.

Context: system_notes is currently disabled in orbit-prd (argocd/apps!2196) precisely because of this OOM; that MR's stated follow-up is "optimize the systemnote datalake query before re-enabling." This MR IS that follow-up. Re-enabling will run the (now-bounded) backfill under the existing schema version.

Closes #830 (closed)

Testing

Unit / query-shape tests:

  • Updated build_plans_wires_system_note_derived_entity_as_extract_only_plan to assert the bounded CTE shape (no INNER JOIN above LIMIT, note_id IN (SELECT DISTINCT id FROM _batch) present).
  • Added system_note_extract_bounds_metadata_join_to_page asserting: _batch CTE wraps only the base siphon_notes scan, enrichment CTE scopes metadata to the page, LEFT JOIN _e0 projects action.
  • every_plan_renders_valid_sql continues to pass for all plans.
  • cargo test -p indexer --lib: 367 tests pass. cargo test -p ontology --lib: 118 tests pass.
  • cargo clippy -p indexer -p ontology --all-features -- -D warnings: zero warnings. cargo fmt -- --check: clean.
  • Ontology schema validation (check-jsonschema): all YAML files pass.

Smaller-scale ClickHouse validation (testcontainer): Seeded a ClickHouse container with 150 notes (100 in namespace 1/2/, 50 in 3/4/) and 1,150 metadata rows (150 matching + 1,000 orphan global rows). Ran old vs new queries with LIMIT 20:

Metric OLD (INNER JOIN) NEW (CTE bounded)
read_rows 1,250 300
read_bytes 39,314 14,155
result_rows 20 20
Results Identical Identical

Prod-scale 108 GiB repro isn't possible locally. The bound holds at scale because the _batch CTE limits the base scan to batch_size rows before any join, and the _e0 CTE reads only metadata matching the page's note IDs — memory scales with the page, not the global table.

Performance Analysis

  • This merge request does not introduce any performance regression. If a performance regression is expected, explain why.

This MR strictly improves performance: the metadata join build side shrinks from the entire global siphon_system_note_metadata table to at most batch_size rows.

Agent context — long-form analysis, file-by-file walkthroughs, profiler output, alternatives considered

Root cause analysis (from #830 (closed)):

The SystemNote derived entity used EtlConfig::Query with from: "siphon_notes AS sn INNER JOIN siphon_system_note_metadata AS snm ON sn.id = snm.note_id". This inlined the metadata join above the LIMIT, so ClickHouse's hash-join planner builds the hash table over the entire right side (siphon_system_note_metadata) in FillingRightJoinSide before producing any output rows. The ORDER BY/LIMIT/cursor pagination only bound the output, not the input.

Fix options considered:

  • Option A (chosen): Page-bounded CTE. Wraps the base scan in a _batch CTE with all scoping inside, then enriches via IN (SELECT id FROM _batch). Reuses the existing EnrichmentSql / render_cte_template pattern.
  • Option B: Range-partition. Doesn't bound the metadata build.
  • Option C: Spill/memory settings. Targets sort/group-by, not hash-join build. Dropped from this MR after review — the OOM was structural, not a memory-cap issue.
  • Option D: ClickHouse dictionaries (#828). Follow-up.

Changes by file:

  1. config/ontology/derived/core/system_note.yaml — Split from to bare siphon_notes AS sn, added page_join config for bounded metadata enrichment.
  2. crates/ontology/src/etl.rs — Added PageJoin struct with configurable watermark field.
  3. crates/ontology/src/loading/node.rs — Added PageJoinYaml deserialization.
  4. config/schemas/ontology.schema.json — Added page_join object schema.
  5. crates/indexer/src/modules/sdlc/plan/input.rs — Synthesizes EnrichmentSql from page_join.
  6. crates/indexer/src/modules/sdlc/plan/lower.rs — Added bounded join shape test.
  7. crates/indexer/src/modules/sdlc/mod.rs — Updated SystemNote plan test.

%{all_commits}

Edited by Dmitry Gruzd

Merge request reports

Loading