Gitter outage 2020-3-06 - Redis Sentinel
Gitter was down.
I followed https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9324#remediation to bring it back online. More info to follow.
Outage
Timeline
All times are UTC
-
2020-03-06 7:30 - sentinel failover
-
2020-03-06 10:00 - production services restarted
-
2020-03-06 13:00 - staging services restarted
-
2020-03-06 15:06 - another sentinel failover (for some reason this one didn't cause full outage)
For some reason, the Issue spiked up again at 18:00 but there is no corresponding sentinel failover
Actions
Reduce time to respond
Last resort - human intervention
Actions that will make sure that operators (Eric, I and the infrastructure team) know immediately that something is wrong.
- Make @viktomas receive PagerDuty notifications
- Make this type of incident cause critical incident (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9416#note_300454215)
Automated response
- Reconnecting to redis after redis sentinel chooses different master https://gitlab.com/gitlab-org/gitter/webapp/issues/2407
- Look into why
webapp
health check doesn't restart the services automatically. Monit will restart service if it fails health check (https://gitlab.com/gitlab-com/gl-infra/gitter-infrastructure/-/blob/master/ansible/roles/gitter/web/templates/monit-service.j2#L86) but the health check is probably still ok when the app fails to connect to redis.
Note: restarting webapp this way might be a bad idea as well. Redis failure will this way cascade and kill webapp servers
Prevent the incident from happening
These are more proactive tasks that should make the incident much less likely or not possible at all.
Easing the load on Redis:
- Remove the eyeballs functionality: https://gitlab.com/gitlab-org/gitter/webapp/-/issues/2448
Discovering root cause
- ??