Research: Keeping the publisher domain mapping artefact for Event Data updated
# Background Event Data uses an artefact which contains a list of publisher domains. These are automatically extracted by resolving some DOIs. However, it cannot be updated in a completely automated manner due to DOIs being registered to non-specific domains (e.g., google.com). The artefact needs regular maintenance, we should determine how best to do this. Some points to note: 1. There should be a list of domains to exclude (google.com, youtube.com, etc.) 1. It may be acceptable to have one version generated completely automatically for use by the percolator, and a more curated version for use by agents such as Twitter where there are cost implications to errors. 1. The frequency of updating the domain is an open question. # Observed behavior The current domain hasn't been updated since 2018 and the update process is manual and cumbersome. Making regular updates could be sufficient to streamline the process. # Expected behavior There should be a regularly updated list of prefix to domain mappings. This may be extracted directly by resolving DOIs, or from DOI to URL mappings listed in an 'identifier server'. # How urgent The domain is used by a number of Event Data agents and the percolator. Without a prefix-domain mapping we do not detect any events for a publisher and some have enquired about why we do not. While updates can be made with the current code, a better solution would reduce toil and improve the quality of data in Event Data. [comment]: # (No need to update the Definition of ready when filing issues, but feel free to have a go if you're familiar with the territory.) # Definition of ready - [x] Product owner: @mrittman - [x] Tech lead: @ppandis - [x] Service:: label applied - [x] Definition of done updated - [x] Acceptance testing plan: report or feedback - [x] Weight applied [comment]: # (Feel free to leave this as is, or suggest changes. We'll update these during Backlog Refinement, prior to bringing this into a sprint.) # Definition of done - [ ] Consider any impacts to current or future architecture/infrastructure, and update specifications and documentation as needed - [ ] Knowledge base reviewed and updated - [ ] Acceptance criteria met - [ ] Report on options for maintaining the artefact - [ ] Decision on whether to split artefacts for the percolator and agents - [ ] Plan for implementation of the preferred solution, with any relevant tickets filed - [ ] Acceptance testing passed # Notes [comment]: # (By default all issues need to be labeled Planning::New, only remove if you know what you're doing)
issue