Gitter outage 2020-2-26 - Redis Sentinel going awry
Gitter outage on 2020-2-26 for about 2 hours
This outage is similar to https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8821 and https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8664 but now we have a better grasp on why it's happening and have some plans to better fix it up (see the follow-up section below).
Investigation
- Redis Sentinel down and vote leader stuff started around 8:34AM
- [8:34AM (GMT-6)]
sentinel-03: +sdown master gitter-master 10.0.11.124 6379https://gitter.pagerduty.com/incidents/PH869RY - [8:34AM (GMT-6)]
sentinel-02: +vote-for-leaderhttps://gitter.pagerduty.com/incidents/P8VP56S - [8:34AM (GMT-6)]
sentinel-02: +sdown master gitter-master 10.0.11.124 6379, https://gitter.pagerduty.com/incidents/PK5M41J - [8:34AM (GMT-6)]
sentinel-02: +try-failover master gitter-master 10.0.11.124 6379, https://gitter.pagerduty.com/incidents/P6TGA09 - etc, ...
- [8:34AM (GMT-6)]
- [8:52AM (GMT-6)]
ws-xxWebSocket servers become unavailable for our SSL check - [9:15AM (GMT-6)] More problems on the
ws-xxWebSocket servers. Not being able to upload it's CPU/memory profiling data to AWSws-07: Profiling CPU usage/tmp/tmp.ShXDGJ5z70.deploy-tools-profile /kernel.kptr_restrict = 0kernel.perf_event_paranoid = 0/profile-service-cpu failedProfiling memory usage/var/log/gitter/heap.ws-07.gitter-websockets-1.29692.2020-02-26-151516.heapsnapshotS3 upload failed
Looking at our logging Kibana for redis (ssh -L 5601:localhost:5601 deployer@logging-01.prod.gitter -> http://localhost:5601/app/kibana):
- Lots of
Redis sentinel client: sentinel message: -failover-abort-not-electedwhen the outage started
Metrics/dashboards
Nothing stands out to me on our dashboards: https://app.datadoghq.com/dashboard/lists. I see the drop in traffic and load when the servers start to not respond.
Screenshots
https://app.datadoghq.com/dashboard/jz3-a5e-mye/redis---overview?from_ts=1582560372393&live=true&tile_size=m&to_ts=1582733172393

https://app.datadoghq.com/dashboard/spf-vjw-ujs/websocket-servers?from_ts=1582560383780&to_ts=1582733183780&live=true&tile_size=m

https://app.datadoghq.com/dashboard/qw5-aw8-ah7/cpu-usage-across-all-servers?from_ts=1582560192763&to_ts=1582732992763&live=true&tile_size=s

https://app.datadoghq.com/dashboard/g9p-cpa-fd8/faye?from_ts=1582560191265&to_ts=1582732991265&live=true&tile_size=m

https://app.datadoghq.com/dashboard/whz-6ej-pd7/everything-mongodb?from_ts=1582560354870&to_ts=1582733154870&live=true&tile_size=m

Remediation
Restarted all of the webapp and ws services just like we did in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8664#mitigated/https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8821 and things are responding and working again in production and staging
$ ssh gitter-beta-01.beta.gitter
$ sudo su
$ cd /opt/gitter-infrastructure/ansible/
$ ansible-playbook -i prod playbooks/gitter/restart-services.yml -t nonstaging --diff
...
# All "ok"
$ ansible-playbook -i prod playbooks/gitter/restart-services.yml -t staging --diff
...
# All "ok"
And we stopped seeing 500's on the gitter-elb-new load balancer in AWS
Follow-up
Fix the Redis latency problems, https://gitlab.com/gitlab-org/gitter/webapp/issues/2448. This is our current OKR for FY21-Q1 so we plan to address this by 2020-4-30 (hopefully sooner)
We should also make sure our client Redis library is able to reconnect properly after the Redis Sentinel juggles things around, https://gitlab.com/gitlab-org/gitter/webapp/issues/2407

