Post-mortem on missing NFS mounts

Original issue and timeline

05-18 19:24 UTC Per https://gitlab.com/gitlab-com/infrastructure/issues/1828, @brian noticed that some workers were (again) not properly mounting the LFS drives, leading to misplacing LFS objects and thus perceived data loss. (Perceived, since the data was not lost, and readily recovered). For the purposes of this particular post-mortem, I'm going to pick the original reverting of the timeo as the starting point.
- Brian recommended resetting the timeo value for NFS mounting on the workers
05-19 03:39 UTC @stanhu realizes we did not cycle all the workers
05-19 03:52 UTC @jtevnan proceeded to adjust the cookbooks (gitlab-cookbooks/gitlab-nfs-cluster!32 (diffs), https://gitlab.com/gitlab-com/infrastructure/issues/1831) which should prevent this from re-occurring. This was written by Jason and merged by John all within about 2 hrs.
05-19 07:14 UTC @northrup reported git workers had been rebooted, sidekiq was next.
05-19 14:35 UTC Following some conversation on chat, and a report by @dblessing that "something strange is still happening", Stan asks who is continuing the work; he and @omame have a call and Daniele continues the reboot work (link to record of which hosts?)
05-19 15:32 UTC Brian started checking which hosts actually have the LFS share mounted https://gitlab.slack.com/archives/C101F3796/p1495207948348488
05-19 16:32 UTC Stan asks Ernst to see what is holding up progress, Ernst reads issue and pings available team members, @ahanselka volunteers to determine status and continue the work.

What went well

Root cause hypothesis of incorrect timeo setting was identified X?? hours after first reports, and hypothesis turned out to be correct.
Issue was made, other team members picked up on the queue, and
- a preventative fix for future re-occurrence was rolled out relatively quickly
- work to reboot all hosts was started
Unscheduled work fully resolved in ~ 24 hours after issue was filed.

What can be improved

How did the timeo variable get reverted back to 1 second? How could that reversion have been caught / prevented?
The hand-off of tasks from one team member to the next as we cycled through timezones and individuals' availabilities should be improved. This issue was not listed in the on-call log, presumably because it did not lead to an outage. However, perceived data loss is equally (or more?) critical, so the following actions are proposed:
- Create the (perceived) data loss label and place it highest in label priority ordering.
- this label needs to be documented and widely known
- clarify in handbook that issues with this label are to be written into the on-call log.
- clarify (how? where? facilitate how?) that handovers need to be explicit; although presumably that is already the case for issues in the on-call log.

Edited May 23, 2017 by Ernst van Nierop