Research: Keeping the publisher domain mapping artefact for Event Data updated
Background
Event Data uses an artefact which contains a list of publisher domains. These are automatically extracted by resolving some DOIs. However, it cannot be updated in a completely automated manner due to DOIs being registered to non-specific domains (e.g., google.com).
The artefact needs regular maintenance, we should determine how best to do this. Some points to note:
- There should be a list of domains to exclude (google.com, youtube.com, etc.)
- It may be acceptable to have one version generated completely automatically for use by the percolator, and a more curated version for use by agents such as Twitter where there are cost implications to errors.
- The frequency of updating the domain is an open question.
Observed behavior
The current domain hasn't been updated since 2018 and the update process is manual and cumbersome. Making regular updates could be sufficient to streamline the process.
Expected behavior
There should be a regularly updated list of prefix to domain mappings. This may be extracted directly by resolving DOIs, or from DOI to URL mappings listed in an 'identifier server'.
How urgent
The domain is used by a number of Event Data agents and the percolator. Without a prefix-domain mapping we do not detect any events for a publisher and some have enquired about why we do not. While updates can be made with the current code, a better solution would reduce toil and improve the quality of data in Event Data.
Definition of ready
-
Product owner: @mrittman -
Tech lead: @ppandis -
Service:: label applied -
Definition of done updated -
Acceptance testing plan: report or feedback -
Weight applied
Definition of done
-
Consider any impacts to current or future architecture/infrastructure, and update specifications and documentation as needed -
Knowledge base reviewed and updated -
Acceptance criteria met -
Report on options for maintaining the artefact -
Decision on whether to split artefacts for the percolator and agents -
Plan for implementation of the preferred solution, with any relevant tickets filed
-
-
Acceptance testing passed