Gitter Email notifications broken since 2020-05-18
This is a screenshot from [Mandrill dashboard](https://mandrillapp.com/?date_format=mm%2Fdd%2Fyy&q=&tag=_all&date_range=30&start_date=05%2F25%2F2020&stop_date=06%2F01%2F2020&__csrf_token=cf7db659023e40a78dc2e81a68e10b6f7c85018b) ![Screenshot_2020-06-01_at_12.01.18_PM](/uploads/869d3ae7719a21f6b96c82ace49bedcb/Screenshot_2020-06-01_at_12.01.18_PM.png) ![Screenshot_2020-06-01_at_12.07.58_PM](/uploads/e1989c20d1cfa047780e3aa7fd3ad8a1/Screenshot_2020-06-01_at_12.07.58_PM.png) It seems that our notifications stopped coming out on 2020-05-18 at 11:00 UTC ## Cause of the outage None of the `webapp` servers had `Group_primary-email-notification-server` or `Group_secondary-email-notification-server` So the ansible [task to add a cron job for notifications](https://gitlab.com/gitlab-com/gl-infra/gitter-infrastructure/-/blob/e3c45fd2332e51f78ec3f78440359a5471a92273/ansible/roles/gitter/web/tasks/cronjobs.yml#L9-17) didn't run. ```yaml - set_fact: notification_email_schedule: "0,10,20,30,40,50" when: "'primary-email-notification-server' in group_names" - set_fact: notification_email_schedule: "5,15,25,35,45,55" when: "'secondary-email-notification-server' in group_names" ``` ## Why there wasn't any tagged notification server? This code is in `user-data.sh` for every `webapp` instance: ```sh # Ensure there is at least one instance that is sending out emails # Looks for any existing instances with the `Group_primary-email-notification-server` tag and if not, adds to the current new server # This probably a little problematic. If two servers startup at the same time, then potentially both could get the tag does_primary_notification_email_server_exist=$(aws --region "$region" ec2 describe-instances --filters "Name=tag:Group_primary-email-notification-server,Values=" "Name=tag:aws:autoscaling:groupName,Values=$as_name" | jq '.Reservations[]') does_secondary_notification_email_server_exist=$(aws --region "$region" ec2 describe-instances --filters "Name=tag:Group_secondary-email-notification-server,Values=" "Name=tag:aws:autoscaling:groupName,Values=$as_name" | jq '.Reservations[]') if [ -z "$does_primary_notification_email_server_exist" ]; then aws --region "$region" ec2 create-tags --resources "$instance_id" --tags Key=Group_primary-email-notification-server,Value= elif [ -z "$does_secondary_notification_email_server_exist" ]; then aws --region "$region" ec2 create-tags --resources "$instance_id" --tags Key=Group_secondary-email-notification-server,Value= fi ``` The comment already says that it could be a little problematic, but it gets much worse. This code will falsely report existing EC2 instance with the `email-notification-server` tag if the instance hasn't been terminated for too long. You can see the following log from `webapp-01` which found terminated `webapp-03` as a `primary-email-notifiaction-server` Unfortunately, it found `webapp-04` as secondary and so `webapp-01` didn't initialize itself as email server. ```sh aws --region us-east-1 ec2 describe-instances --filters Name=tag:Group_primary-email-notification-server,Values= Name=tag:aws:autoscaling:groupName,Values=webapp-servers + does_primary_notification_email_server_exist='{ "Groups": [], "Instances": [ { "StateReason": { "Code": "Client.UserInitiatedShutdown", "Message": "Client.UserInitiatedShutdown: User initiated shutdown" }, "Tags": [ { "Value": "prod", "Key": "Env" }, { "Value": "", "Key": "Group_webapp-servers" }, { "Value": "webapp-servers", "Key": "aws:autoscaling:groupName" }, { "Value": "webapp-03", "Key": "Name" }, { "Value": "", "Key": "Group_primary-email-notification-server" }, ], "State": { "Code": 48, "Name": "terminated" }, "StateTransitionReason": "User initiated (2020-05-18 11:36:09 GMT)", } ] }' ``` I remember terminating the last 4 instances at once because it was early morning in Europe and I knew that the remaining 4 instances will easily handle the load for 5 minutes until the new instances boot up. ## Original issue https://gitlab.com/gitlab-org/gitter/webapp/-/issues/2532 Discovered by @@jeremyVignelles <details> ## I don't get email notifications anymore I'm really disappointed with the gitter's notification system. I have the android app, but I never managed to make it send me any notification. That's beyond the point here, and I managed to survive for two years with only e-mail notifications, sent 1 hour after the message. Now, a few weeks ago, gitter suddenly stopped sending me e-mail notifications, but the "all notifications" checkbox on the room's settings is still checked. More info: - I'm logging in with my GitHub account - I'm using gitter on 3 different browser, but never at the same time. - I have the app installed, but I'm never using it - Several messages, on different channels were sent during the weekend, and I didn't get any notification. </details>
issue