feat(testing): add datalake generator test tool

What does this MR do and why?

Adds the datalake-generator crate, a test-only tool that seeds ClickHouse with synthetic data for Knowledge Graph development. This is not production code -- it exists so developers can spin up a populated datalake locally without needing real GitLab data.

The generator:

  • Builds deterministic foundation entities (users, groups, projects) and writes per-project entities across staged ClickHouse inserts
  • Uses OS threads for CPU-bound row generation with Arrow RecordBatches, connected to async ClickHouse inserts via a bounded channel
  • Supports a continuous mode that generates ongoing insert/update/delete traffic after the initial seed
  • Persists state to disk so continuous mode can resume from where it left off

Also adds datalake-generator-state/ to .gitignore since the state directory is a local artifact.

Relates to [sdlc] Create an ETL simulator for ClickHouse c... (#113)

Testing

Ran the generator against a local ClickHouse instance to validate both initial seeding and continuous mode.

Seed run (continuous disabled): Seeded 5.5M+ rows across 22 tables — 100 users, 35 groups, 350 projects, plus all dependent tables (merge requests, work items, pipelines, notes, CI builds, etc.). State was written to datalake-generator-state/ for reuse.

Full run with continuous mode enabled: After seeding, continuous mode ran 10 cycles at 5-second intervals. Each cycle produced 50 inserts, 45 updates, and 3 deletes (500/450/30 total). All operations hit the configured entity types (MergeRequest, WorkItem, Pipeline, Note).

ClickHouse verification: Queried the target tables after the run. Row counts line up: siphon_notes has 2,625,150 rows (2,625,000 seed + 150 from 15 notes/cycle × 10 cycles), siphon_p_ci_pipelines has 262,550 (262,500 + 50).

Expected warnings: Vulnerability-related tables (siphon_vulnerabilities, siphon_security_scans, etc.) were skipped because they don't exist in the local schema yet. The generator logs warnings and moves on.

Performance Analysis

  • This merge request does not introduce any performance regression. If a performance regression is expected, explain why.

Merge request reports

Loading