July 15th outage postmortem
Context
An Azure outage caused the majority of our NFS server to go down, bringing down the whole service with it for over 7 hours.
Timeline
On date: 2017-07-15
- 03:45 UTC - We're paged about some NFS servers being down (7 out of 12)
- 03:55 UTC - We enable the deploy page
- 04:00 UTC - Seeing no clear reason for the outage, we create a ticket with critical severity on Azure portal
- 04:10 UTC - We get a response from Azure support acknowledging the existence of an issue with several Linux VMs
- 04:26 UTC - While waiting for an ETA for a restore of service from Azure support, we try to redeploy one of the NFS nodes through Azure portal but it fails
- 04:55 UTC - We change our deploy page to indicate the outage and to link to our status Twitter feed
- 05:30~09:15 UTC - We tweet about any updates we get from Azure
- 08:40 UTC - Almost all of our API nodes are getting their disks full because of a constant stream of Unicorn errors and backtraces (over 20G worth of logs), we clear those logs
- 10:30 UTC - We get alerting notifications about the NFS nodes being up
- 10:45 UTC - We make sure the outage didn't affect data integrity, all looking good
- 11:15 UTC - We bounce some nodes that didn't respond properly to the NFS mounts
- 11:30 UTC - We disable the deploy page, bringing the service up
- 11:45 UTC - We notice that some projects are showing a warning about non-existent repository, a symptom of cache poisoning, so we clear the cache, projects are back to normal
- 11:55 UTC - Some users are getting 500 errors, we suspect we're running out of DB connections so we bounce pgbouncer, errors are gone
- 13:11 UTC - nfs-08 file system corruption detected, doc with recover steps: https://docs.google.com/document/d/14KUrDQi8gDYx-JIAyFlf1SXgabramLas4wgDxEeHmDo/edit#heading=h.90osf4nff4m0
Corrective actions
- Create an alert for postgres connections usage - https://gitlab.com/gitlab-com/infrastructure/issues/2299
Edited by Daniele Valeriani [GitLab]