Gitter Email notifications broken since 2020-05-18
This is a screenshot from [Mandrill dashboard](https://mandrillapp.com/?date_format=mm%2Fdd%2Fyy&q=&tag=_all&date_range=30&start_date=05%2F25%2F2020&stop_date=06%2F01%2F2020&__csrf_token=cf7db659023e40a78dc2e81a68e10b6f7c85018b)


It seems that our notifications stopped coming out on 2020-05-18 at 11:00 UTC
## Cause of the outage
None of the `webapp` servers had `Group_primary-email-notification-server` or `Group_secondary-email-notification-server`
So the ansible [task to add a cron job for notifications](https://gitlab.com/gitlab-com/gl-infra/gitter-infrastructure/-/blob/e3c45fd2332e51f78ec3f78440359a5471a92273/ansible/roles/gitter/web/tasks/cronjobs.yml#L9-17) didn't run.
```yaml
- set_fact:
notification_email_schedule: "0,10,20,30,40,50"
when: "'primary-email-notification-server' in group_names"
- set_fact:
notification_email_schedule: "5,15,25,35,45,55"
when: "'secondary-email-notification-server' in group_names"
```
## Why there wasn't any tagged notification server?
This code is in `user-data.sh` for every `webapp` instance:
```sh
# Ensure there is at least one instance that is sending out emails
# Looks for any existing instances with the `Group_primary-email-notification-server` tag and if not, adds to the current new server
# This probably a little problematic. If two servers startup at the same time, then potentially both could get the tag
does_primary_notification_email_server_exist=$(aws --region "$region" ec2 describe-instances --filters "Name=tag:Group_primary-email-notification-server,Values=" "Name=tag:aws:autoscaling:groupName,Values=$as_name" | jq '.Reservations[]')
does_secondary_notification_email_server_exist=$(aws --region "$region" ec2 describe-instances --filters "Name=tag:Group_secondary-email-notification-server,Values=" "Name=tag:aws:autoscaling:groupName,Values=$as_name" | jq '.Reservations[]')
if [ -z "$does_primary_notification_email_server_exist" ]; then
aws --region "$region" ec2 create-tags --resources "$instance_id" --tags Key=Group_primary-email-notification-server,Value=
elif [ -z "$does_secondary_notification_email_server_exist" ]; then
aws --region "$region" ec2 create-tags --resources "$instance_id" --tags Key=Group_secondary-email-notification-server,Value=
fi
```
The comment already says that it could be a little problematic, but it gets much worse. This code will falsely report existing EC2 instance with the `email-notification-server` tag if the instance hasn't been terminated for too long.
You can see the following log from `webapp-01` which found terminated `webapp-03` as a `primary-email-notifiaction-server` Unfortunately, it found `webapp-04` as secondary and so `webapp-01` didn't initialize itself as email server.
```sh
aws --region us-east-1 ec2 describe-instances --filters Name=tag:Group_primary-email-notification-server,Values= Name=tag:aws:autoscaling:groupName,Values=webapp-servers
+ does_primary_notification_email_server_exist='{
"Groups": [],
"Instances": [
{
"StateReason": {
"Code": "Client.UserInitiatedShutdown",
"Message": "Client.UserInitiatedShutdown: User initiated shutdown"
},
"Tags": [
{
"Value": "prod",
"Key": "Env"
},
{
"Value": "",
"Key": "Group_webapp-servers"
},
{
"Value": "webapp-servers",
"Key": "aws:autoscaling:groupName"
},
{
"Value": "webapp-03",
"Key": "Name"
},
{
"Value": "",
"Key": "Group_primary-email-notification-server"
},
],
"State": {
"Code": 48,
"Name": "terminated"
},
"StateTransitionReason": "User initiated (2020-05-18 11:36:09 GMT)",
}
]
}'
```
I remember terminating the last 4 instances at once because it was early morning in Europe and I knew that the remaining 4 instances will easily handle the load for 5 minutes until the new instances boot up.
## Original issue
https://gitlab.com/gitlab-org/gitter/webapp/-/issues/2532
Discovered by @@jeremyVignelles
<details>
## I don't get email notifications anymore
I'm really disappointed with the gitter's notification system.
I have the android app, but I never managed to make it send me any notification. That's beyond the point here, and I managed to survive for two years with only e-mail notifications, sent 1 hour after the message.
Now, a few weeks ago, gitter suddenly stopped sending me e-mail notifications, but the "all notifications" checkbox on the room's settings is still checked.
More info:
- I'm logging in with my GitHub account
- I'm using gitter on 3 different browser, but never at the same time.
- I have the app installed, but I'm never using it
- Several messages, on different channels were sent during the weekend, and I didn't get any notification.
</details>
issue