Skip to content

July 15th outage postmortem

Context

An Azure outage caused the majority of our NFS server to go down, bringing down the whole service with it for over 7 hours.

Timeline

On date: 2017-07-15

  • 03:45 UTC - We're paged about some NFS servers being down (7 out of 12)
  • 03:55 UTC - We enable the deploy page
  • 04:00 UTC - Seeing no clear reason for the outage, we create a ticket with critical severity on Azure portal
  • 04:10 UTC - We get a response from Azure support acknowledging the existence of an issue with several Linux VMs
  • 04:26 UTC - While waiting for an ETA for a restore of service from Azure support, we try to redeploy one of the NFS nodes through Azure portal but it fails
  • 04:55 UTC - We change our deploy page to indicate the outage and to link to our status Twitter feed
  • 05:30~09:15 UTC - We tweet about any updates we get from Azure
  • 08:40 UTC - Almost all of our API nodes are getting their disks full because of a constant stream of Unicorn errors and backtraces (over 20G worth of logs), we clear those logs
  • 10:30 UTC - We get alerting notifications about the NFS nodes being up
  • 10:45 UTC - We make sure the outage didn't affect data integrity, all looking good
  • 11:15 UTC - We bounce some nodes that didn't respond properly to the NFS mounts
  • 11:30 UTC - We disable the deploy page, bringing the service up
  • 11:45 UTC - We notice that some projects are showing a warning about non-existent repository, a symptom of cache poisoning, so we clear the cache, projects are back to normal
  • 11:55 UTC - Some users are getting 500 errors, we suspect we're running out of DB connections so we bounce pgbouncer, errors are gone
  • 13:11 UTC - nfs-08 file system corruption detected, doc with recover steps: https://docs.google.com/document/d/14KUrDQi8gDYx-JIAyFlf1SXgabramLas4wgDxEeHmDo/edit#heading=h.90osf4nff4m0

Corrective actions

Edited by Daniele Valeriani [GitLab]