Search: ArtifactDocumentBuilder assemble() flat fields + parity harness (!1236) · Merge requests · cdli / framework

Part of #2616, stacked on the data layer MR (!1231). It targets that branch, so this diff is only the transform layer.

The data layer left assemble() as a throwing stub, this makes it real. It's a pure transform (no DB) from the fetched bundle into the search document, plus the parity test the later transform PRs extend. It builds the flat half of the doc: scalars, single joins, the multi-value arrays, the ids, two inscription scalars (65 fields). The nested objects and ATF come later.

Most of the diff is fixtures, not code: about 870 lines of code and 3,900 of captured JSON.

Parity, not an exact match. The builder corrects a few known Logstash quirks (no ORDER BY, empties dropped on split, the unsplit collection_attribution/_comment column, witnesses joined with ': '). ParityComparator encodes each as a typed rule, so intended divergences pass but a regression in any covered field still fails. The one thing the field-by-field compare can't catch is a mis-aligned unzip, so the unit tests assert column alignment directly.

Fixtures. Eight ids, each with an input/ bundle (from fetchData()) and the captured expected/ ES _source: 16 files, full bundles so later PRs reuse them. The ids hit the awkward shapes: the provenience_ar graft and its null guard, all three cdli_id shapes, collection_ascii present and absent, the multi-material empty-strip, the multi-witness ': ' join.

Not here: the nested arrays (asset/external_resource/update/publication), created/modified, primary_publication_designation, and the ATF block. All later.

Tests: ArtifactDocumentBuilderAssembleTest drives the stages on synthetic rows; ArtifactDocumentBuilderParityTest compares against the captured docs. Both DB-free, green in dev_cake_composer.

Edited Jun 16, 2026 by sung

Search: ArtifactDocumentBuilder assemble() flat fields + parity harness

Merge request reports