Monitoring: adding some foundational checks for deposit processing in production (doi.crossref.org)
Right now we have some very basic alerting in place through Slack in the #queue-view channel that pings the channel when the queue exceeds 35,000 submissions to doi.crossref.org. This alarm is helpful, but support still relies heavily on members and observation to determine when problems arise in submission processing (as was exemplified with this incident from October 2021: https://status.crossref.org/incidents/z4jx9js265pp).
After talking with Mike, Jon, and Sara, we'd all like to expand those checks to give us more visibility into the health of submission processing in production.
We'd like to monitor (with checks every 5 to 10 minutes) for the following:
- If threads die or lock up
- If a submission exceeds 6 hours to process (according to Mike, this should just not happen)
- Small file alerts: a file of size X has taken Y amount of time to process (this might be a sign of a larger problem in processing; we'll need some research and perhaps tweaking to get the variables correct here)
How urgent
Would like to have this in place by 2022 January 01
Definition of ready
-
Product owner: @SaraBowman -
Tech lead: @jonmstark -
Service:: or C:: label applied -
Definition of done updated -
Acceptance testing plan: check in production -
Weight applied
Definition of done
-
Unit tests identified, implemented, and passing -
Code reviewed -
Available for acceptance testing via a staging URL, or otherwise -
Consider any impacts to current or future architecture/infrastructure, and update specifications and documentation as needed -
Knowledge base reviewed and updated -
Acceptance criteria met -
Perform check on submissions in process every 10 minutes and send an alert to #queue-view Slack channel if a submission has been processing for >6 hours -
Perform check on processing threads every 10 minutes and send an alert to #queue-view Slack channel if the number of active threads has dropped since the last check -
Perform check on processing threads every 10 minutes and send an alert to #queue-view Slack channel if one or more threads progress % has not increased since the last check
-
-
Acceptance testing passed -
Deployed to production
Notes
Edited by Patrick Polischuk