Search: ArtifactDocumentBuilder data layer (relation fetchers + CTEs)
Part of #2616. Builds on the foundation merged in !1230 (merged)
This MR adds the data (fetch) layer of the artifact document builder. It assembles the raw row bundle a single artifact's search document needs, fetched in batched queries (per relation, never per-artifact), and returns it unshaped. It reproduces the Logstash SQL pipeline (dev/data/logstash/resources/artifacts.sql), but per-artifact and without GROUP_CONCAT, so indexing can run incrementally rather than as a full rebuild. The transforms that turn these rows into the final document are part of next MR, nothing here transforms.
What's here
preloadHierarchies()and the four recursive CTEs (artifact_types,materials,genres,languages), cached once as id-to-path maps and resolved into the bundle. These are the only recursive SQL, the relation fetchers are batchedWHERE id IN (...)selects.- fetchers (12): the core row (scalars and single joins), then relations (11): collections (+ license), materials, genres, languages, external resources, composites, seals, the latest inscription, assets (+ annotation values and authors), publications (+ authors/editors by sequence,
COALESCEpublisher), and approved update events (linked via theUNIONofartifacts_updatesandinscriptions). build()wired to the full structure: an absent or retired artifact returns a delete signal, everything else routes toassemble(), which still throws until PR3 (next one).- A documented row-shape contract on
fetchData(), so PR3 is built against a fixed shape.
Notes on the port
- Every pipeline filter is re-applied (
inscriptions.is_latest = 1,update_events.status = 'approved',entities_publications.table_name = 'artifacts'). Skipping any would pull stale inscription revisions, unapproved updates, or publication rows from other entity types. - The source SQL has no
ORDER BY, so the fetchers add deterministic ordering (junction id, publication authors/editors by sequence). Some multi-value arrays therefore return in a different order than the current ES document. This is an intended parity correction, not a regression. chemical_datausesIF(EXISTS(...))instead of the pipeline's join. That also removes a latent duplication: one artifact has sixchemical_datarows, so the LogstashLEFT JOINemits six duplicate documents (ES masks it by id-dedup).- The constructor is typed to
Cake\Database\Connection, sinceexecute()is defined on the concrete class, notConnectionInterface. This is the same annotationArtifactsTablealready uses for its raw queries.
Not in this MR
- Document assembly and the transform stages (
make_exact_reference, the provenience_ar graft, ascii variants, nesting into objects, the creator-first author combine) - The ATF parser
- Output shaping for
[]/null/key-absent fields
Testing
A @group db test runs against the imported cdli_db on known artifact ids: read-only, no fixtures, since the test datasource points at the real database. It covers each relation's shape, the CTE paths resolving to "A > B", author/editor sequence order, the re-applied filters, the boolean casts, and the delete branches.
Running the @group db test requires the cdli_db import in the dev database; with that, composer test in dev_cake_composer runs the full suite green. Without the import, skip it with vendor/bin/phpunit --exclude-group db (the unit tests need no database).