Detect when Coverage batch process has completed and alert us when it has not
Background
Frequently staff and members are asking why data is out of sync in rest api / member route / participation reports etc. This is mainly due to coverage calculations not completing. Manual intervention by Tech team is needed to update a specific thing.
Observed behavior
Coverage data does not update for days and we are not made aware of it until we manual see it or a member complains.
Expected behavior
When Coverage calculator batch process completes it should push metrics to say so. If we do not see Coverage metrics for 48 hours it should trigger an alert.
From #1150 (comment 536567077)
Prometheus has a mechanism -a "push gateway"- to allow batch processes to easily push metrics into the system. We can then use these to build alerts with Prometheus, Grafana or other tools. We have a push gateway installed and running on prom1
in the datacenter. You can send metrics to it easily using curl
. So, for example, if this batch process were called cr_coverage_calculator
, the following bash fragment would allow you to send a time stamp to our Prometheus instance:
#!/bin/bash
# Define a timestamp function
timestamp() {
date +"%s" # current time
}
echo "start_timestamp $(timestamp)" | curl --data-binary @- http://prom1:9091/metrics/job/cr_coverage_calculator
# do whatever batch thang you are doing
echo "end_timestamp $(timestamp)" | curl --data-binary @- http://prom1:9091/metrics/job/cr_coverage_calculator
And then we could create an alert on this that lets us know if the metric has not been seen in X hours.
How urgent
Coverage calculations are important to our members who are working to improve their metadata. This data drives Participation Reports and therefore it is vital the numbers update on schedule, and this process does not fail silently.
Definition of ready
-
Product owner: -
Tech lead: -
Service:: or C:: label applied -
Definition of done updated -
Acceptance testing plan: -
Weight applied
Definition of done
-
Unit tests identified, implemented, and passing -
Code reviewed -
Available for acceptance testing via a staging URL, or otherwise -
Consider any impacts to current or future architecture/infrastructure, and update specifications and documentation as needed -
Knowledge base reviewed and updated -
Public documentation reviewed and updated -
Acceptance criteria met -
AC 1 -
AC 2
-
-
Acceptance testing passed -
Deployed to production