2019-01-07 Read-Only Mount - Incident

Summary

On January 7th, 2019 around 14:18PM UTC, 33 production servers had their in-memory data synced to disks and the disks remounted in RO (read-only) mode and this caused a short duration of production outage across the services. The sync-and-remount operation was being validated as part of a maintenance activity for staging environment. Unfortunately, an erroneous command issued ended up touching the unintended production servers as well. Rebooting the servers mounted the disks back in RW (read-write) mode and it remedied the immediate outage. This was a manual, user error and as part of RCA we will prevent similar issues from happening again in the future.

Service(s) affected :

  • pages
  • patroni, postgres
  • prometheus
  • pubsub
  • package

Team attribution : Production/SRE Minutes downtime or degradation : 20 minutes - 14:18 - 14:38 UTC

Timeline

2019-01-07 While validating maintenance commands against a set of staging hosts, an erroneous command was ran which ended up touching not only the intended test hosts but also ended up touching 33 production hosts which included services such as pages, patroni, postgres, prometheus, pubsub and package.

  • 14:16 UTC - Postgres primary stopped because of readonly filesystem
  • 14:18 UTC - Alerts started coming in and paged on-call
  • 14:20 UTC - Tweeted and let users know we are investigating the issue
  • 14:21 UTC - Incident channel opened and zoom started
  • 14:22 UTC - Detected no postgres process was running on patroni nodes (and filesystems being read-only was observed)
  • 14:31 UTC - Recovery Postgres instances have been restarted, postgres successfully performs recovery
  • 14:33 UTC - Tweeted and let users know we identified an issue and working on repairing
  • 14:38 UTC - Rebooted the remaining affected hosts
  • 14:42 UTC - Tweeted and let users know we repaired the DB backend and are monitoring the health of the system
  • 14:44 UTC - Tweeted and let users know Pages would be in degraded performance
  • 14:47 UTC - Package server was also having the issue
  • 14:55 UTC - Tweeted .com and Pages are operational and we were monitoring
  • 15:01 UTC - Rebooted package server
  • 15:07 UTC - Tweeted saying all services are operational
  • 15:31 UTC - Dead tuple alerts because of missing statistics, @abrandl: kicked off ANALYZE => done 16:18 UTC, stats are up to date

Notes - there does not appear to be any dataloss related to the postgres nodes. Rebooting the nodes brought the filesystems back to healthy state and we did not require any restore procedures to be performed.

Edited Jan 07, 2019 by Andreas Brandl
Assignee Loading
Time tracking Loading