system_notes backfill OOMs the datalake on large namespaces (FillingRightJoinSide join) (#830) · Issues · GitLab.org / orbit / GitLab Orbit

system_notes backfill OOMs the datalake on large namespaces (FillingRightJoinSide join)

### Summary With system_notes enabled in orbit-prd, the SystemNote backfill runs a server-side join of siphon_notes to siphon_system_note_metadata scoped to a whole top-level namespace. On the largest namespaces that join exceeds the datalake ClickHouse 108 GiB query-memory limit and terminates SDLC indexing messages. ### Expected behavior The system-notes backfill stays within the datalake query-memory budget for every namespace, so the schema migration finishes with system-note edges materialized everywhere. ### Actual behavior During the 0.71.0 / schema v56 migration backfill with system_notes on, the systemnote handler fails on the datalake with `Code: 241 MEMORY_LIMIT_EXCEEDED` against the 108 GiB ceiling. Client page-streaming shrinks the read block (None down to 25k), but the join and sort run on the datalake before any rows stream, so the retries never bound server memory. They exhaust, and the namespace message is terminated (handler `max_attempts: 1`, dead-letter on exhaustion). The largest namespaces' system notes never reach v56, and the heavy query also starved the shared datalake into collateral memory failures on unrelated tables (merge_requests, siphon_environments). ### Steps to reproduce 1. Enable system_notes in orbit-prd (chart >= 1.1.0, image >= 0.70.0). 2. Run a full SDLC backfill (schema migration) over a dense top-level namespace such as gitlab-org. 3. Watch the systemnote handler fail with a memory-limit error while paging the notes join. ### Environment - Cluster: orbit-prd - Gkg version: 0.71.0 (helm chart 1.1.0) - Schema version: v56 (migrating at the time; v55 active) - Datalake ClickHouse: 25.12, query-memory limit 108 GiB ### Logs <details><summary>Expand logs</summary> ``` ERROR handler failed handler="entity.systemnote" target="indexer::engine" error: processing failed: datalake stream failed: query failed: query error: bad response: Code: 241. DB::Exception: (total) memory limit exceeded: would use 112.38 GiB ... maximum: 108.00 GiB ... reading ... siphon_system_note_metadata ... While executing FillingRightJoinSide. (MEMORY_LIMIT_EXCEEDED) WARN retry attempts exhausted attempt=1 max_attempts=1 topic="GKG_INDEXER.sdlc.namespace.indexing.requested.*.*" ``` One retry attempted a 32 GiB chunk allocation inside `FillingRightJoinSide`. </details> ### Workaround Disabled system_notes in orbit-prd via argocd-apps !2196, keeping image 0.71.0 and chart 1.1.0. The ArgoCD indexer rolling restart cleared it: datalake memory-limit errors dropped from 96 to ~0 and the systemnote handler is no longer registered. Staging keeps system_notes on for the fix. ### Additional context The resolver lookups in resolve.rs are already bounded (top-level `startsWith(traversal_path, ...)` range scan, chunked params after #827 / !1557). The unbounded path is the source query in `config/ontology/derived/core/system_note.yaml`: its `from` join plus `order_by: [traversal_path, id]` materializes and sorts a whole namespace's system notes on the datalake. Same dense-namespace scaling class as #827 (uri too long) and #828 (dictionary resolver). cc @dgruzd Candidate directions: chunk the source scan by sub-namespace (project-level traversal_path) instead of the whole top-level; let the datalake query spill (`max_bytes_before_external_sort`, `max_bytes_before_external_group_by`, or a per-query `max_memory_usage`); or move resolution onto dictionaries (#828) and drop the server-side join. Related: #499 (umbrella), #827, #828, !1529, !1557, !1559, ADR-013 (!1336). Distinct from #794, which was indexer pod OOM rather than the datalake. <details> <summary><b>Agent context</b> — extended analysis, evidence, rollout timeline</summary> Rollout timeline (2026-06-05, UTC). argocd-apps !2194 bumped gkg 0.70.1 to 0.71.0 (stg+prd) and enabled system_notes in prd (chart 1.0.0 to 1.1.0). The 0.71.0 image triggered the v55 to v56 schema migration. During the v56 backfill the systemnote handler hit the datalake memory ceiling. !2196 removed the flag; the indexer Deployments rolled (config checksum `4101b0fb` to `47cdd3a7`) at 23:15:42 and resumed the backfill from `v56_checkpoint` / `v56_code_indexing_checkpoint` with no progress lost (per-table row counts kept climbing across the restart). Before and after the disable, from Kibana `pubsub-orbit-inf-gprd*` (filter on `json.jsonPayload.level.keyword`): | signal | pre-disable (23:00–23:15:42) | post-disable (23:15:42+) | |---|---|---| | MEMORY_LIMIT_EXCEEDED | 96 | 1 (in-flight residual) | | systemnote handler-failed | 85 | 2 (in-flight residual) | | collateral SDLC entities (project, job, finding, vuln, ...) | ~40 across 20 handlers | ~0 | Why page-streaming did not help: the "datalake page failed, retrying with smaller block size" path (#798) bounds the client read block, but `FillingRightJoinSide` builds the join's right side and `ORDER BY traversal_path, id` sorts on the datalake server before rows stream out. Shrinking the block size leaves that server working set untouched, so retries exhausted at block size 25k. Source query, from `config/ontology/derived/core/system_note.yaml`: ``` from: siphon_notes AS sn INNER JOIN siphon_system_note_metadata AS snm ON sn.id = snm.note_id where: sn.system = true AND sn._siphon_deleted = false AND snm._siphon_deleted = false traversal_path_filter: startsWith(sn.traversal_path, {traversal_path:String}) order_by: [traversal_path, id] ``` Separate, minor signal, not this bug: a few v56_gl_edge write failures (entity.workitem, entity.mergerequestdifffile, 3 total post-disable, roughly 1 per 2 minutes, not escalating) with `Code: 209 ... Timeout ... socket` / `SendRequest`. Those are graph-side write timeouts under heavy backfill load, retried, and unrelated to the datalake memory problem. Flagging for awareness only. </details>

issue