Search: ArtifactDocumentBuilder data layer (relation fetchers + CTEs) (!1231) · Merge requests · cdli / framework

Part of #2616. Builds on the foundation merged in !1230 (merged)

This MR adds the data (fetch) layer of the artifact document builder. It assembles the raw row bundle a single artifact's search document needs, fetched in batched queries (per relation, never per-artifact), and returns it unshaped. It reproduces the Logstash SQL pipeline (dev/data/logstash/resources/artifacts.sql), but per-artifact and without GROUP_CONCAT, so indexing can run incrementally rather than as a full rebuild. The transforms that turn these rows into the final document are part of next MR, nothing here transforms.

What's here

preloadHierarchies() and the four recursive CTEs (artifact_types, materials, genres, languages), cached once as id-to-path maps and resolved into the bundle. These are the only recursive SQL, the relation fetchers are batched WHERE id IN (...) selects.
fetchers (12): the core row (scalars and single joins), then relations (11): collections (+ license), materials, genres, languages, external resources, composites, seals, the latest inscription, assets (+ annotation values and authors), publications (+ authors/editors by sequence, COALESCE publisher), and approved update events (linked via the UNION of artifacts_updates and inscriptions).
build() wired to the full structure: an absent or retired artifact returns a delete signal, everything else routes to assemble(), which still throws until PR3 (next one).
A documented row-shape contract on fetchData(), so PR3 is built against a fixed shape.

Notes on the port

Every pipeline filter is re-applied (inscriptions.is_latest = 1, update_events.status = 'approved', entities_publications.table_name = 'artifacts'). Skipping any would pull stale inscription revisions, unapproved updates, or publication rows from other entity types.
The source SQL has no ORDER BY, so the fetchers add deterministic ordering (junction id, publication authors/editors by sequence). Some multi-value arrays therefore return in a different order than the current ES document. This is an intended parity correction, not a regression.
chemical_data uses IF(EXISTS(...)) instead of the pipeline's join. That also removes a latent duplication: one artifact has six chemical_data rows, so the Logstash LEFT JOIN emits six duplicate documents (ES masks it by id-dedup).
The constructor is typed to Cake\Database\Connection, since execute() is defined on the concrete class, not ConnectionInterface. This is the same annotation ArtifactsTable already uses for its raw queries.

Not in this MR

Document assembly and the transform stages (make_exact_reference, the provenience_ar graft, ascii variants, nesting into objects, the creator-first author combine)
The ATF parser
Output shaping for []/null/key-absent fields

Testing

A @group db test runs against the imported cdli_db on known artifact ids: read-only, no fixtures, since the test datasource points at the real database. It covers each relation's shape, the CTE paths resolving to "A > B", author/editor sequence order, the re-applied filters, the boolean casts, and the delete branches.

Running the @group db test requires the cdli_db import in the dev database; with that, composer test in dev_cake_composer runs the full suite green. Without the import, skip it with vendor/bin/phpunit --exclude-group db (the unit tests need no database).

Search: ArtifactDocumentBuilder data layer (relation fetchers + CTEs)

What's here

Notes on the port

Not in this MR

Testing

Merge request reports