Query pagination: redaction breaks LIMIT/OFFSET, need cursor support (#248) · Issues · GitLab.org / orbit / GitLab Knowledge Graph

Query pagination: redaction breaks LIMIT/OFFSET, need cursor support

## Summary The unified response schema ([snippet 5965394](https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/snippets/5965394)) defines a single `{ columns, nodes, edges }` JSON shape for all five query types. Pagination was left out of that design. This issue documents how the current LIMIT/OFFSET mechanism works, why redaction breaks it, and what we'd need to change. ## Current state The query engine supports offset-based pagination through two input fields: - `limit: u32` (default 30) maps to SQL `LIMIT` - `range: Option<{start, end}>` converts to `LIMIT (end-start) OFFSET start` in `lower.rs:46-55` SQL generation in `codegen.rs:214-220` emits `LIMIT N OFFSET M` on the final query. This works fine at the ClickHouse level, but redaction makes it unreliable for the frontend. ### Per-query-type behavior | Query type | How LIMIT applies | Notes | |---|---|---| | Search | `LIMIT/OFFSET` on single-table query | | | Traversal (single-hop) | `LIMIT/OFFSET` on JOIN result | | | Traversal (multi-hop) | `LIMIT/OFFSET` after `UNION ALL` of all hop depths | Combined result, not per-subquery | | Aggregation | `LIMIT/OFFSET` on `GROUP BY` result | | | Path finding | Hard CTE limit of 1000 paths internally (`lower.rs:229`), then user's `LIMIT/OFFSET` on outer query | Two-stage limiting | | Neighbors | `LIMIT/OFFSET` on neighbor join result | | ## The redaction problem The pipeline runs in this order: ``` ClickHouse (LIMIT 20) → 20 rows → Extraction → Authorization (Rails gRPC) → Redaction → N authorized rows (N <= 20) ``` If we request 20 rows and 5 fail authorization, the frontend gets 15. The pipeline doesn't go back for more. `row_count` in the proto response (`QueryMetadata.row_count`) is the post-redaction count. `redacted_count` exists internally in `PipelineOutput` but isn't exposed to the client (redaction is invisible to consumers, per the unified schema spec). This makes offset-based pagination unreliable. Page 2 at `OFFSET 20` skips rows that were never delivered on page 1 because they were redacted. The frontend can't detect or compensate for this. ### Relevant code paths - Redaction marking: `redaction/query_result.rs:273-314`, `apply_authorizations()` marks rows via `set_unauthorized()` - Row counting: `stages/formatting.rs:24`, `authorized_count()` filters to authorized rows only - Proto output: `grpc/service.rs:156`, `row_count` is post-redaction - Fail-closed policy: any row with an unverifiable node is marked unauthorized (`query_result.rs:286, 341`) ## What's missing in the proto Current `QueryMetadata` (`gkg.proto:70-82`): ```protobuf message QueryMetadata { string query_type = 1; repeated string raw_query_strings = 2; int32 row_count = 3; } ``` No cursor, no `has_more` flag, no total count. The frontend can't paginate. ## Existing cursor pattern in the codebase The indexer already does cursor-based pagination in `crates/indexer/src/modules/sdlc/plan/mod.rs:14-116`. It builds composite key cursors with DNF (disjunctive normal form) WHERE conditions: ``` (c1 > v1) OR (c1 = v1 AND c2 > v2) ``` This avoids the performance cost of large OFFSETs on ClickHouse and could be adapted for the query engine. ## Design options ### Option A: Over-fetch and trim Request more rows than needed from ClickHouse (e.g., `LIMIT 30` when the frontend wants 20). After redaction, return the first 20 authorized rows and encode the last consumed ClickHouse row's position as a cursor. Single round-trip per page. The downside is that the over-fetch ratio is unpredictable. In high-redaction scenarios, even 2x may not be enough. The factor could be derived from historical redaction rates per namespace. ### Option B: Iterative backfill Request `LIMIT 20`, authorize, and if fewer than 20 rows survive redaction, issue another query with `OFFSET 20` for the next batch. Repeat until the page is full or results are exhausted. Always delivers the requested page size (when possible), but requires multiple ClickHouse round-trips per page, with latency spikes when redaction rates are high. ### Option C: Pre-authorized result cache Run the query with a large limit (e.g., 1000), authorize all rows, cache the authorized result set server-side with a session ID, and serve pages from the cache. Fast page navigation after initial load and predictable page sizes. But it adds memory pressure, cache invalidation complexity, and server-side state. ### Recommendation Option A is the most practical starting point. Start with a 1.5x over-fetch factor and tune it later. For the initial implementation, falling short of the requested page size is fine. The `has_more` flag tells the frontend there are more results, and it can request the next page. ## Proposed proto changes ```protobuf message QueryMetadata { string query_type = 1; repeated string raw_query_strings = 2; int32 row_count = 3; string cursor = 4; // opaque, encodes ClickHouse offset + sort key bool has_more = 5; // true if more results exist beyond this page } ``` No `total_count`. That requires a separate `SELECT count(*)` which is expensive on ClickHouse for filtered queries and would double query load. The cursor is opaque to consumers. Internally it encodes enough state to resume the query: offset, sort key values, or a keyset depending on the approach. ## Interaction with the unified response schema The `{ columns, nodes, edges }` shape works well with pagination: - Nodes are deduplicated, so paginating by rows and deduplicating nodes per page works naturally - The frontend appends nodes/edges from subsequent pages to the existing graph - For table view, each page adds rows to the stacked entity tables - `has_more` controls the "Load more" button ## Open questions 1. Should cursor-based pagination use keyset (WHERE-based, like the indexer) or offset-based (simpler but slower for deep pages)? 2. For path finding with the 1000-path CTE hard limit, should pagination apply to paths or to the flattened node/edge result? 3. Should the over-fetch factor be configurable per namespace or global? 4. How should `has_more` be determined? Requesting `LIMIT N+1` and checking if the extra row exists is one option.

issue