Stabilize infrastructure for Event Data Query API and Bus
Background
Revisit AWS-based infrastructure for health. Investigate and fix any issues arising, without making substantial changes. Short-term until we can move into modern AWS.
Steps to take:
- Rewrite Agent to use the Crossref API and retrieve work metadata as it goes.
- Agent should include all relations, public references and all references to DataCite. It should perform a DOI-RA lookup to do this.
- Ensure that the Agent uses the same mechanism for generating Event IDs so the behaviour is the same.
- Run new Agent as a one-off backfill.
- Run new Agent on continuous basis.
Observed behavior
Monitoring of the Bus and Query infrastructure is unreliable and it's difficult to be sure if the problems are down to infrastructure or something else.
Expected behavior
Error rate reduced to acceptable level (close to zero). No alarms tripped for Event Bus or Query API.
Definition of ready
-
Product owner: @mrittman -
Tech lead: @afandian -
Service:: label applied -
Definition of done updated -
Weight applied
Definition of done
-
Unit tests identified, implemented, and passing -
Code reviewed -
Available via a staging URL -
Knowledge base reviewed and updated -
Public documentation reviewed and updated -
Consider any impacts to current or future architecture/infrastructure, and update specifications and documentation as needed -
Acceptance criteria met -
Pingdom alerting for Event Data Query API is verified as accurate. -
Pingdom alerting is stable. -
Elastic Search index is up to date and stable.
-
Notes
Edited by Martyn Rittman