2019-01-07 Read-Only Mount - Incident

Summary

On January 7th, 2019 around 14:18PM UTC, 33 production servers had their in-memory data synced to disks and the disks remounted in RO (read-only) mode and this caused a short duration of production outage across the services. The sync-and-remount operation was being validated as part of a maintenance activity for staging environment. Unfortunately, an erroneous command issued ended up touching the unintended production servers as well. Rebooting the servers mounted the disks back in RW (read-write) mode and it remedied the immediate outage. This was a manual, user error and as part of RCA we will prevent similar issues from happening again in the future.

Service(s) affected :

pages
patroni, postgres
prometheus
pubsub
package

Team attribution : Production/SRE Minutes downtime or degradation : 20 minutes - 14:18 - 14:38 UTC

Timeline

2019-01-07 While validating maintenance commands against a set of staging hosts, an erroneous command was ran which ended up touching not only the intended test hosts but also ended up touching 33 production hosts which included services such as pages, patroni, postgres, prometheus, pubsub and package.

14:16 UTC - Postgres primary stopped because of readonly filesystem
14:18 UTC - Alerts started coming in and paged on-call
14:20 UTC - Tweeted and let users know we are investigating the issue
14:21 UTC - Incident channel opened and zoom started
14:22 UTC - Detected no postgres process was running on patroni nodes (and filesystems being read-only was observed)
14:31 UTC - Recovery Postgres instances have been restarted, postgres successfully performs recovery
14:33 UTC - Tweeted and let users know we identified an issue and working on repairing
14:38 UTC - Rebooted the remaining affected hosts
14:42 UTC - Tweeted and let users know we repaired the DB backend and are monitoring the health of the system
14:44 UTC - Tweeted and let users know Pages would be in degraded performance
14:47 UTC - Package server was also having the issue
14:55 UTC - Tweeted .com and Pages are operational and we were monitoring
15:01 UTC - Rebooted package server
15:07 UTC - Tweeted saying all services are operational
15:31 UTC - Dead tuple alerts because of missing statistics, @abrandl: kicked off ANALYZE => done 16:18 UTC, stats are up to date

Notes - there does not appear to be any dataloss related to the postgres nodes. Rebooting the nodes brought the filesystems back to healthy state and we did not require any restore procedures to be performed.

Edited Jan 07, 2019 by Andreas Brandl