Cayenne should not remove characters from DOIs during ingestion
Background
Cayenne removes some characters, including non-printable and unicode characters, from DOI while ingesting. This happens when XML is transformed into ItemTree. The scale is unknown.
The change was introduced here: rest_api@923086f5 and affected both old and new REST API.
Observed behavior
Example DOI 10.2741/Ortéga
was transformed into DOI 10.2741/Ortga
during ingestion: https://api.crossref.org/works/10.2741/Ortga
So we have two unwanted effects: 1) some DOIs are not accessible though REST API or JSON snapshot (10.2741/Ortéga), and 2) REST API and JSON snapshot contain DOIs that do not exist (10.2741/Ortga)
Expected behavior
Cayenne should not modify DOIs in any way. When a DOI appears in the bucket, it means it is successfully registered, so Cayenne should ingest and index it as is.
See also @gbilder's comment: #1231 (comment 702197157)
How urgent
Important also in the context of the Manifold.
Definition of ready
-
Product owner: @ppolischuk1 -
Tech lead: @dtkaczyk -
Service:: or C:: label applied -
Definition of done updated -
Acceptance testing plan: -
Weight applied
Definition of done
-
Unit tests identified, implemented, and passing -
Code reviewed -
Available for acceptance testing via a staging URL, or otherwise -
Consider any impacts to current or future architecture/infrastructure, and update specifications and documentation as needed -
Knowledge base reviewed and updated -
Public documentation reviewed and updated -
Acceptance criteria met -
update the code so that no characters are removed from DOIs -
determine the list of DOIs that were affected by removal of characters in the past -
delete from ES index modified DOIs that were result of character removal -
generate a mapping of deleted and replacement DOIs
-
-
Acceptance testing passed -
Deployed to production