July 15th outage postmortem

Context

An Azure outage caused the majority of our NFS server to go down, bringing down the whole service with it for over 7 hours.

Timeline

On date: 2017-07-15

03:45 UTC - We're paged about some NFS servers being down (7 out of 12)
03:55 UTC - We enable the deploy page
04:00 UTC - Seeing no clear reason for the outage, we create a ticket with critical severity on Azure portal
04:10 UTC - We get a response from Azure support acknowledging the existence of an issue with several Linux VMs
04:26 UTC - While waiting for an ETA for a restore of service from Azure support, we try to redeploy one of the NFS nodes through Azure portal but it fails
04:55 UTC - We change our deploy page to indicate the outage and to link to our status Twitter feed
05:30~09:15 UTC - We tweet about any updates we get from Azure
08:40 UTC - Almost all of our API nodes are getting their disks full because of a constant stream of Unicorn errors and backtraces (over 20G worth of logs), we clear those logs
10:30 UTC - We get alerting notifications about the NFS nodes being up
10:45 UTC - We make sure the outage didn't affect data integrity, all looking good
11:15 UTC - We bounce some nodes that didn't respond properly to the NFS mounts
11:30 UTC - We disable the deploy page, bringing the service up
11:45 UTC - We notice that some projects are showing a warning about non-existent repository, a symptom of cache poisoning, so we clear the cache, projects are back to normal
11:55 UTC - Some users are getting 500 errors, we suspect we're running out of DB connections so we bounce pgbouncer, errors are gone
13:11 UTC - nfs-08 file system corruption detected, doc with recover steps: https://docs.google.com/document/d/14KUrDQi8gDYx-JIAyFlf1SXgabramLas4wgDxEeHmDo/edit#heading=h.90osf4nff4m0

Corrective actions

Create an alert for postgres connections usage - https://gitlab.com/gitlab-com/infrastructure/issues/2299

Edited Jul 19, 2017 by Daniele Valeriani [GitLab]