Skip to content

[SE-3487] enable LOS for NRQL alert conditions

Boros Gábor requested to merge gabor/update-new-relic-los-policies into master

Based on NewRelic's announcement, we must change our NRQL alert condition to enable loss of signal detection. Also, based on my understanding, in our use case it worth to fill the gap with static "0" values to indicate we did not receive data during that time period. This way we would remain "backward compatible" in terms of the shape of the data.

This change will require us to "re-install" all conditions we have, which means that first we will delete them, then add it again. An alternate solution could be to update all conditions with the "Migrator" application NewRelic provided to help the migration. Both directions can work, but I'd go with re-installing the conditions OCIM knows.

For those conditions which are added manually (like which checks OCIM itself) we need to use the Migrator application.

Screenshots:

Screenshot 2020-10-14 at 13 50 18

Sandbox URL: N/A

Testing instructions:

  1. Go to NewRelic and check the policy created from stage env and check the "Thresholds" section. It must contain the new "Loss of Signal" related changes (Signal lost after settings)
  2. Go to OCIM production and get the NewRelic related settings
  3. Go to OCIM stage, checkout this branch and add the copied settings to .env
  4. Restart the shell on stage
  5. Get a random (successfully provisioned) instance's ID
  6. Get the instance object from OpenEdXInstance model
  7. Call the enable_monitoring() function on the instance
  8. Check that NewRelic created the monitoring policy and set the desired parameters
  9. Revert the changes on stage to use master branch - including deleting the copied NewRelic settings

Author notes and concerns:

  1. The REST documentation is not updated yet, but the API already knows the new options. For the APIs request scheme, please check it's playground. For the documentation of NerdGraph - which already contains the documentation of the new functionality - , you can check this documentation.
  2. Based on the UI (I did not find documentation about that though) the expiration_duration must be at least 60 seconds.
  3. I set expiration.closeViolationsOnExpiration to False to be more explicit, though its default value is False

Reviewers

Merge request reports