[SE-3953] Updates NewRelic policy to check for failure instead
Created by: nizarmah
This check checks for failure instead of success.
This check also makes sure that 3 2 consecutive failures happened before creating an alert/violation.
Once 3 2 alerts happen in the matter of 16 11 minutes, a violation will be opened. In order to guarantee the alert is closed, we use the 'loss of signal' or 'signal expiration'. If we don't have any signal for a duration of 16 11 mins, ie no failed request for 16 11 mins, it means that all issues were resolved, so close the violation.
JIRA tickets: SE-3953
Screenshots:
Sandbox Instance: Sandbox Instance
Testing instructions:
- Open NewRelic Synthetic Summary for the sandbox instance's extended heartbeat check in a new tab.
- Open NewRelic Policy for the sandbox instance's extended heartbeat check in a new tab.
- SSH into the sandbox instance.
ssh 149.202.172.84
- Set the LMS env settings yaml file to the error file.
sudo cp /edx/etc/lms-error.yml /edx/etc/lms.yml
- Restart the LMS
/edx/bin/supervisorctl restart lms
- Wait
1611 mins, and an incident should be opened for the sandbox instance. - SSH into the sandbox instance, again.
ssh 149.202.172.84
- Set the LMS env settings yaml file to the fix file.
sudo cp /edx/etc/lms-fix.yml /edx/etc/lms.yml
- Restart the LMS, again.
/edx/bin/supervisorctl restart lms
- Wait
1611 mins, and the incident should be closed for the sandbox instance.
Author notes and concerns:
- NewRelic solution architect explains how to automatically close Failure checking alerts in this forum thread.
Reviewers