Skip to content

[SE-3953] Updates NewRelic policy to check for failure instead

Boros Gábor requested to merge nizar/se-3953-newrelic-new-policy into master

Created by: nizarmah

This check checks for failure instead of success.

This check also makes sure that 3 2 consecutive failures happened before creating an alert/violation.

Once 3 2 alerts happen in the matter of 16 11 minutes, a violation will be opened. In order to guarantee the alert is closed, we use the 'loss of signal' or 'signal expiration'. If we don't have any signal for a duration of 16 11 mins, ie no failed request for 16 11 mins, it means that all issues were resolved, so close the violation.

JIRA tickets: SE-3953

Screenshots:

  • Opening Alert: image

  • Closing Alert: image

Sandbox Instance: Sandbox Instance

Testing instructions:

  1. Open NewRelic Synthetic Summary for the sandbox instance's extended heartbeat check in a new tab.
  2. Open NewRelic Policy for the sandbox instance's extended heartbeat check in a new tab.
  3. SSH into the sandbox instance. ssh 149.202.172.84
  4. Set the LMS env settings yaml file to the error file. sudo cp /edx/etc/lms-error.yml /edx/etc/lms.yml
  5. Restart the LMS /edx/bin/supervisorctl restart lms
  6. Wait 16 11 mins, and an incident should be opened for the sandbox instance.
  7. SSH into the sandbox instance, again. ssh 149.202.172.84
  8. Set the LMS env settings yaml file to the fix file. sudo cp /edx/etc/lms-fix.yml /edx/etc/lms.yml
  9. Restart the LMS, again. /edx/bin/supervisorctl restart lms
  10. Wait 16 11 mins, and the incident should be closed for the sandbox instance.

Author notes and concerns:

  1. NewRelic solution architect explains how to automatically close Failure checking alerts in this forum thread.

Reviewers

Merge request reports