Add ArtifactDocumentBuilder skeleton and the diacritics/reference helpers

Part of #2616.

First piece of the PHP document builder that will replace the Ruby filter Logstash currently uses to build artifact search documents (dev/data/logstash/resources/artifacts.rb). Moving it into PHP lets us build documents one artifact at a time, so indexing can be incremental instead of a nightly full rebuild. That comes later, this MR sets up the structure and ports the two self-contained helpers, so the heavier fetch and transform work lands on something that's already tested.

What's here

  • DocumentBuilderInterface: the contract the entity builders will share (build(array $ids), indexName()). Builders only produce documents, pushing them to ES/OpenSearch stays the indexer's job.
  • ArtifactDocumentBuilder: the skeleton. The constructor takes the DB connection, build() returns early on an empty id list and otherwise throws until the fetch layer exists, and assemble() is the seam the field-by-field parity test will use against the current Logstash output. makeExactReference() already lives here since it's pure string logic (artifacts.rb:14-22).
  • DiacriticsHelper: removeDiacritics() (NFD, strip the combining-mark ranges, then NFC; artifacts.rb:1-12) and graftProvenienceAr() (the (mod. …) Arabic substitution, artifacts.rb:36-40).

Two notes on the port:

  • removeDiacritics is kept faithful to the Ruby, so letters with no canonical decomposition (ø, æ, ß, …) pass through unchanged. That means it does not fix Schøyen (#2018) on its own. That belongs at the OpenSearch analyzer when we redo the mapping, not in the builder.
  • graftProvenienceAr uses preg_replace_callback rather than preg_replace, so a $ or \ in an Arabic name can't be read as a backreference. Nothing in the data hits this today, it's a cheap guard.

Not in this MR (follow-ups)

  • Batched per-relation fetch plus the recursive-CTE hierarchy lookups
  • Document assembly / transform stages
  • The ATF parser

build() and assemble() throw plain "not implemented yet" errors until then.

Testing

22 unit tests, all pure (no DB or search engine), so they need no service container to run. They cover the diacritics ranges (including the Assyriological glyphs that actually show up in the corpus), the make_exact_reference punctuation rule, the (mod. …) graft edge cases, and the stub behaviour. Protected methods go through a small Testable subclass, the same pattern as the existing ElasticSearchQueryTest. Run with composer test inside dev_cake_composer.

Merge request reports

Loading