Add ArtifactDocumentBuilder skeleton and the diacritics/reference helpers
Part of #2616.
First piece of the PHP document builder that will replace the Ruby filter Logstash currently uses to build artifact search documents (dev/data/logstash/resources/artifacts.rb). Moving it into PHP lets us build documents one artifact at a time, so indexing can be incremental instead of a nightly full rebuild. That comes later, this MR sets up the structure and ports the two self-contained helpers, so the heavier fetch and transform work lands on something that's already tested.
What's here
DocumentBuilderInterface: the contract the entity builders will share (build(array $ids),indexName()). Builders only produce documents, pushing them to ES/OpenSearch stays the indexer's job.ArtifactDocumentBuilder: the skeleton. The constructor takes the DB connection,build()returns early on an empty id list and otherwise throws until the fetch layer exists, andassemble()is the seam the field-by-field parity test will use against the current Logstash output.makeExactReference()already lives here since it's pure string logic (artifacts.rb:14-22).DiacriticsHelper:removeDiacritics()(NFD, strip the combining-mark ranges, then NFC;artifacts.rb:1-12) andgraftProvenienceAr()(the(mod. …)Arabic substitution,artifacts.rb:36-40).
Two notes on the port:
removeDiacriticsis kept faithful to the Ruby, so letters with no canonical decomposition (ø, æ, ß, …) pass through unchanged. That means it does not fix Schøyen (#2018) on its own. That belongs at the OpenSearch analyzer when we redo the mapping, not in the builder.graftProvenienceArusespreg_replace_callbackrather thanpreg_replace, so a$or\in an Arabic name can't be read as a backreference. Nothing in the data hits this today, it's a cheap guard.
Not in this MR (follow-ups)
- Batched per-relation fetch plus the recursive-CTE hierarchy lookups
- Document assembly / transform stages
- The ATF parser
build() and assemble() throw plain "not implemented yet" errors until then.
Testing
22 unit tests, all pure (no DB or search engine), so they need no service container to run. They cover the diacritics ranges (including the Assyriological glyphs that actually show up in the corpus), the make_exact_reference punctuation rule, the (mod. …) graft edge cases, and the stub behaviour. Protected methods go through a small Testable subclass, the same pattern as the existing ElasticSearchQueryTest. Run with composer test inside dev_cake_composer.