Skip to content

[kuzu][deployed] spike: handling large text fields (like description)

Problem Statement

Kuzu’s execution engine materializes every property on a node once the Cypher projection requests that node (e.g. RETURN n). Internally, a STRING value is stored as a 16-byte ku_string_t header with an overflow pointer for long payloads (src/include/common/types/ku_string.h in [@https://github.com/kuzudb/kuzu]). During projection, the executor follows that pointer, streams the overflow pages from OverflowFile, and serializes the whole byte sequence back to the client. That behaviour is correct, but in our GitLab SDLC benchmark (feat: gitlab sdlc graph performance analysis (!292)), we intentionally generated half-megabyte descriptions for ~5 % of issues and merge requests (GitLabDataGenerator::maybe_long_text in the Rust harness). Consequently, any query that returned Issue or MergeRequest wholesale had to ship hundreds of kilobytes per row, increasing latency from ~100 ms (metadata only) to ~18 seconds, even when filters were fully indexed.

Why “fetch-on-demand” inside the same statement fails today

  • No nested LOAD FROM. The binder rejects LOAD FROM inside subqueries (bind_load_from.cpp throws LOAD FROM subquery is not supported. in [@https://github.com/kuzudb/kuzu]). We cannot scan external Parquet/Delta bodies inside the same Cypher pipeline.
  • Projection is all-or-nothing. The logical/physical planners treat RETURN n as a request for the entire struct, so there is no opportunity to skip description if the node literal is returned.
  • Indexes only solve lookup. Full-text or secondary indices will speed up WHERE, but the cost of returning 512 KB strings remains once the filter succeeds.

Context: observed impact

Query shape Payload Median latency
MATCH (i:Issue) RETURN i.id, i.title ~100 B per row ~100 ms
MATCH (i:Issue) RETURN i 512 KB for skewed rows ~18 s
MATCH (p)-[:HAS_DESCRIPTION]->(d) RETURN p, d.body 512 KB but only on demand ~150 ms

The tests were run after importing a 72 M node / 268 M relationship dataset; ingest finished in ~5.5 minutes, so ingestion is not the bottleneck—result set size is.

Existing workarounds

  1. Project only the needed fields. Keep description in the same table, but ensure queries return explicit columns. Fetch the body with a second query when the UI actually needs it. Simple and minimally invasive, but requires two round-trips.
  2. External pointer. Move the body into Parquet/S3 and store a pointer (file, row index). LOAD FROM or application code fetches the payload separately. Keeps the main graph small, but still cannot be stitched into the same Cypher statement due to the binder restriction above.
  3. Split into body nodes. Create (:IssueBody {body}) and (:MergeRequestBody {body}) entities and link them via (:Issue)-[:HAS_DESCRIPTION]->(:IssueBody). Default queries remain lightweight (MATCH (i) RETURN i.id, …), while callers that need the text can extend their pattern (OPTIONAL MATCH (i)-[:HAS_DESCRIPTION]->(d) RETURN i, d.body). This keeps the logic in a single query and fits Kuzu’s storage semantics: body nodes still use overflow pages, but only when explicitly matched.

Proposed Solution

Adopt option (3): introduce dedicated body nodes and relationships while leaving the original metadata nodes untouched. The plan is:

  1. Schema extension. Add new node tables (IssueBody, MergeRequestBody) keyed by id (matching the parent) and a single body STRING column. Add HAS_DESCRIPTION relationships from issue/mr to the new tables.
  2. Data loader update. Modify the Rust generator so GitLabIssue.description holds only summaries (or is removed) and the long text is written to IssueBody. Migration can be done by scanning existing records and creating the linked node.
  3. Query patterns. Update and document canonical Cypher patterns: metadata queries stay the same, while body-aware queries become MATCH (i:Issue {id: $id}) OPTIONAL MATCH (i)-[:HAS_DESCRIPTION]->(d) RETURN i, d.body.
  4. API compatibility. For a transition period we can keep the old column (populated for backfills) and mark it deprecated. Once all consumers adopt the edge-based pattern we can drop the column entirely to reduce storage.

Benefits

  • Fast defaults. Queries that return only metadata no longer stream giant payloads by accident.
  • Single-statement fetch. Clients that need the description still issue one Cypher statement; they simply opt into the extra MATCH.
  • Storage locality. Kuzu’s overflow machinery is reused for the body nodes, so we keep native compression and checkpointing.

Follow-up work

  • Spec an API for federated “body storage” should we move to Parquet/S3 later. That would require either changing the binder to allow LOAD FROM subqueries or exposing a table function that accepts parameters.
  • Investigate whether future Kuzu releases plan to support lazy column materialization or selective projection (RETURN n{.id, .title}) to mitigate similar scenarios without schema changes.
Edited by Michael Angelo Rivera