Gitter Down (Redis backup fail -> DDOS) -- 2018-05-01
Summary
Likely a Redis issue led to a multitude of websocket clients getting kicked off and attempting to reconnect en-mass, leading to a DDOS of the Gitter servers. The increased load led to some instances being marked as unhealthy, which in turn led to the AutoscalingGroup terminating them, in anticipation for provisioning new, healthy instances. Unfortunately the provisioning scripts had not been tested in 9+ months and failed for several reasons (including Python package changes and the nfs-file-16 outage). The remaining hosts had to deal with more load, leading to more failures, until the entire worker fleet had been terminated by the psychotic ASG.
@MadLittleMods and @andrewn focused initially on fixing the provisioning scripts and then rebuilt the entire worker fleet (16 machines) from scratch. We then addressed the problem with Redis, adding a new volume with more PIOPS. It’s looking much healthier now.
Next step, we absolutely need to get the packer script and provisioning done on a CI job, so that we know when it fails. We’ll also focus on improving the notifications around failed provisioning.