Gitter outage 2020-2-26 - Redis Sentinel going awry

Gitter outage on 2020-2-26 for about 2 hours

This outage is similar to https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8821 and https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8664 but now we have a better grasp on why it's happening and have some plans to better fix it up (see the follow-up section below).

Investigation


Looking at our logging Kibana for redis (ssh -L 5601:localhost:5601 deployer@logging-01.prod.gitter -> http://localhost:5601/app/kibana):

  • Lots of Redis sentinel client: sentinel message: -failover-abort-not-elected when the outage started

Screen_Shot_2020-02-26_at_11.03.08_AM

Metrics/dashboards

Nothing stands out to me on our dashboards: https://app.datadoghq.com/dashboard/lists. I see the drop in traffic and load when the servers start to not respond.

Screenshots

https://app.datadoghq.com/dashboard/jz3-a5e-mye/redis---overview?from_ts=1582560372393&live=true&tile_size=m&to_ts=1582733172393 chrome_2020-02-26_10-30-24

https://app.datadoghq.com/dashboard/spf-vjw-ujs/websocket-servers?from_ts=1582560383780&to_ts=1582733183780&live=true&tile_size=m

https://app.datadoghq.com/dashboard/qw5-aw8-ah7/cpu-usage-across-all-servers?from_ts=1582560192763&to_ts=1582732992763&live=true&tile_size=s chrome_2020-02-26_10-27-17

https://app.datadoghq.com/dashboard/g9p-cpa-fd8/faye?from_ts=1582560191265&to_ts=1582732991265&live=true&tile_size=m chrome_2020-02-26_10-28-01

https://app.datadoghq.com/dashboard/whz-6ej-pd7/everything-mongodb?from_ts=1582560354870&to_ts=1582733154870&live=true&tile_size=m chrome_2020-02-26_10-29-33

Remediation

Restarted all of the webapp and ws services just like we did in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8664#mitigated/https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8821 and things are responding and working again in production and staging

$ ssh gitter-beta-01.beta.gitter

$ sudo su
$ cd /opt/gitter-infrastructure/ansible/

$ ansible-playbook -i prod playbooks/gitter/restart-services.yml -t nonstaging --diff
...
# All "ok"


$ ansible-playbook -i prod playbooks/gitter/restart-services.yml -t staging --diff
...
# All "ok"

And we stopped seeing 500's on the gitter-elb-new load balancer in AWS

Screen_Shot_2020-02-26_at_10.33.28_AM

Follow-up

Fix the Redis latency problems, https://gitlab.com/gitlab-org/gitter/webapp/issues/2448. This is our current OKR for FY21-Q1 so we plan to address this by 2020-4-30 (hopefully sooner)

We should also make sure our client Redis library is able to reconnect properly after the Redis Sentinel juggles things around, https://gitlab.com/gitlab-org/gitter/webapp/issues/2407

Edited by Eric Eastwood