Gitter outage 2020-2-26 - Redis Sentinel going awry

Gitter outage on 2020-2-26 for about 2 hours

This outage is similar to https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8821 and https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8664 but now we have a better grasp on why it's happening and have some plans to better fix it up (see the follow-up section below).

Investigation

Redis Sentinel down and vote leader stuff started around 8:34AM
- [8:34AM (GMT-6)] sentinel-03: +sdown master gitter-master 10.0.11.124 6379 https://gitter.pagerduty.com/incidents/PH869RY
- [8:34AM (GMT-6)] sentinel-02: +vote-for-leader https://gitter.pagerduty.com/incidents/P8VP56S
- [8:34AM (GMT-6)] sentinel-02: +sdown master gitter-master 10.0.11.124 6379, https://gitter.pagerduty.com/incidents/PK5M41J
- [8:34AM (GMT-6)] sentinel-02: +try-failover master gitter-master 10.0.11.124 6379, https://gitter.pagerduty.com/incidents/P6TGA09
- etc, ...
[8:52AM (GMT-6)] ws-xx WebSocket servers become unavailable for our SSL check
[9:15AM (GMT-6)] More problems on the ws-xx WebSocket servers. Not being able to upload it's CPU/memory profiling data to AWS ws-07: Profiling CPU usage/tmp/tmp.ShXDGJ5z70.deploy-tools-profile /kernel.kptr_restrict = 0kernel.perf_event_paranoid = 0/profile-service-cpu failedProfiling memory usage/var/log/gitter/heap.ws-07.gitter-websockets-1.29692.2020-02-26-151516.heapsnapshotS3 upload failed

Looking at our logging Kibana for redis (ssh -L 5601:localhost:5601 deployer@logging-01.prod.gitter -> http://localhost:5601/app/kibana):

Lots of Redis sentinel client: sentinel message: -failover-abort-not-elected when the outage started

Metrics/dashboards

Nothing stands out to me on our dashboards: https://app.datadoghq.com/dashboard/lists. I see the drop in traffic and load when the servers start to not respond.

Screenshots

https://app.datadoghq.com/dashboard/jz3-a5e-mye/redis---overview?from_ts=1582560372393&live=true&tile_size=m&to_ts=1582733172393

https://app.datadoghq.com/dashboard/spf-vjw-ujs/websocket-servers?from_ts=1582560383780&to_ts=1582733183780&live=true&tile_size=m

https://app.datadoghq.com/dashboard/qw5-aw8-ah7/cpu-usage-across-all-servers?from_ts=1582560192763&to_ts=1582732992763&live=true&tile_size=s

https://app.datadoghq.com/dashboard/g9p-cpa-fd8/faye?from_ts=1582560191265&to_ts=1582732991265&live=true&tile_size=m

https://app.datadoghq.com/dashboard/whz-6ej-pd7/everything-mongodb?from_ts=1582560354870&to_ts=1582733154870&live=true&tile_size=m

Remediation

Restarted all of the webapp and ws services just like we did in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8664#mitigated/https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8821 and things are responding and working again in production and staging

$ ssh gitter-beta-01.beta.gitter

$ sudo su
$ cd /opt/gitter-infrastructure/ansible/

$ ansible-playbook -i prod playbooks/gitter/restart-services.yml -t nonstaging --diff
...
# All "ok"


$ ansible-playbook -i prod playbooks/gitter/restart-services.yml -t staging --diff
...
# All "ok"

And we stopped seeing 500's on the gitter-elb-new load balancer in AWS ✔

Follow-up

Fix the Redis latency problems, https://gitlab.com/gitlab-org/gitter/webapp/issues/2448. This is our current OKR for FY21-Q1 so we plan to address this by 2020-4-30 (hopefully sooner)

We should also make sure our client Redis library is able to reconnect properly after the Redis Sentinel juggles things around, https://gitlab.com/gitlab-org/gitter/webapp/issues/2407

Edited Nov 18, 2020 by Eric Eastwood