Skip to content

[SE-1968] Send an urgent e-mail after all deployment attempts fail

This PR doesn't change the e-mails that we currently send to ops@ after each deployment failure. This PR adds a new type of e-mail which would go to the address we have in PagerDuty. The e-mail is triggered when all deployment attempts have been used. Only for betatest instances.

Fine details

  • the urgent e-mail actually goes to instance.provisioning_failure_notification_emails, whatever that is in each instance. It's mainly our PagerDuty e-mail. We'll assume that that value was already set in all important instances in production (this comes from a previous development)
  • the urgent e-mail is only sent if the instance has a beta test application, i.e. if they got the instance after filling in a form. This excludes test instances that we have created manually, and also big client instances (they were also created manually). It should be fine not to page when big client instances have some deployment failure, since they have older servers running
  • we don't consider special cases:
    • mass redeployments. We will still page if a mass-redeployment finds a beta test instance which fails N of N attempts. This would wake people up even when the failure isn't urgent. Implementing a solution for this is hard
    • sending e-mail to the owner of a sandbox, e.g. to the person who created the sandbox
    • detecting the organization and acting differently
    • any other e-mail. We don't use instance.email (this a public e-mail for the LMS), nor instance.additional_monitoring_emails (these are very similar, extremely similar, to instance.provisioning_failure_notification_emails, but the former are used to detect LMS unavailability, and the latter for provisioning failures. We could merge them)
    • it doesn't send the failure reason / backtrace by e-mail, because the failures could be many (one for each attempt) and this is complex to do and test. The e-mail is just a warning that triggers PagerDuty and wakes people up; the logs are in Ocim
    • we didn't consider/test the new interface. Though the implemented behaviour will work the same and be equally useful (i.e. after the last attempt of a user-requested deployment, it will page)
    • we didn't test much how this affects: PR sandboxes, periodic builds. With periodic builds, a slight difference is that the person who watches the list (ops+master…@…) will get a single e-mail after all N of N deployments fail, instead of 1 per failed attempt

This PR replaces https://github.com/open-craft/opencraft/pull/510/, which changed the existing e-mail types, whereas this one adds on top.

Dependencies: None

Testing instructions:

This is hard…

I suggest in your devstack:

Prepare a way to fail faster

In instance/models/openedx_appserver.py

         # Check firewall rules:
         try:
+            # FIXME remove.   Uncomment this line to accelerate testing failures
+            raise NotImplementedError("You won't deploy today")
             self.check_security_groups()
         except:  # pylint: disable=bare-except
             message = "Unable to check/update the network security groups for the new VM"

At instance/models/openedx_instance.py:

             Returns the ID of the new AppServer on success or None on failure.
         """
+        # FIXME uncomment this fast-return line to make the failures faster. Disable while testing CI
+        return self._create_owned_appserver()
+
         if not self.load_balancing_server:
             self.load_balancing_server = LoadBalancingServer.objects.select_random()
             self.save()

Make the blue button retry 3 times instead of just 1

Important because when you attempt 3 times, you'll be able to see whether you get 3 urgent e-mails (1 per attempt, bad), or just 1 (good).

At instance/tasks.py:

         instance_ref_id,
         mark_active_on_success=False,
         deactivate_old_appservers=False,
-        num_attempts=1,
+        # FIXME just testing. Restore back to 1. Set to 4 to test
+        num_attempts=3,
         success_tag=None,
         failure_tag=None):
     """

Rest of testing instructions

  1. Create an instance in your devstack. Do it from the form, or adapt it so that it has a BetaTestApplication which is approved
  2. in the admin, add a Provisioning failure notification emails: you can type it as a normal string, e.g. sometext@someemail.com
  3. in your .env, check your ADMINS. It should say something like ADMINS='"Devopos", "ops@somedomain.com"'`
  4. click the blue button. You should see 3 e-mails with the backtrace (1 per attempt) to the address in ADMINS, and then a short e-mail at the end to the other address
  5. retry a few times
  6. try it in an instance which doesn't have a BetaTestApplication. It will send only the 3 ops e-mails, not the urgent one
  7. Remove the provisioning_failure_notification_emails from the first instance; now it should not send the urgent e-mail to it
  8. Run the CI tests
  9. Test in stage! With real deployments. You can make the deployments fail in some way, like shutting down the instance while it's provisioning
  10. Ideally, test it with different types of error: VM/infrastructure provisioning error, playbook error

Review and notes

  • Check the e-mail message text too.
  • I wasn't sure about where to place the new tests. First I put them in test_openedx_appserver.py, near test_provision_failed_email, but it's not good because those are tests for 1 attempt (there's no num_attempts concept). test_openedx_instance.py however has multi-attempts. And the test is near test_spawn_appserver_failed
  • I considered using ddt to combine both tests I wrote but I'd have to add more logic and conditionals inside the test, so I preferred to copy a few lines.
  • I will squash the commits done.
  • There's a small fix about DNS (gandi.py); I saw some warning or error about an undeclared variable
  • I saw a hilarious pylint error where it complains about a long attribute name (which was already there…) and I can't suppress it. Details. pylint: useless-suppression / Useless suppression of 'invalid-name'. We're using pylint 2.2.3 and it has some bugs (I already saw this some months ago at https://github.com/open-craft/opencraft/pull/436 but the newer pylint versions were also broken) Fixed by adding noqa to the pylint comment.

Deployment

Should be easy, it doesn't require migrations. Not dangerous.

Merge request reports

Loading