Post-mortem on missing NFS mounts
Original issue and timeline
- 05-18 19:24 UTC Per https://gitlab.com/gitlab-com/infrastructure/issues/1828, @brian noticed that some workers were (again) not properly mounting the LFS drives, leading to misplacing LFS objects and thus perceived data loss. (Perceived, since the data was not lost, and readily recovered). For the purposes of this particular post-mortem, I'm going to pick the original reverting of the
timeo
as the starting point.- Brian recommended resetting the
timeo
value for NFS mounting on the workers
- Brian recommended resetting the
- 05-19 03:39 UTC @stanhu realizes we did not cycle all the workers
- 05-19 03:52 UTC @jtevnan proceeded to adjust the cookbooks (gitlab-cookbooks/gitlab-nfs-cluster!32 (diffs), https://gitlab.com/gitlab-com/infrastructure/issues/1831) which should prevent this from re-occurring. This was written by Jason and merged by John all within about 2 hrs.
- 05-19 07:14 UTC @northrup reported git workers had been rebooted, sidekiq was next.
- 05-19 14:35 UTC Following some conversation on chat, and a report by @dblessing that "something strange is still happening", Stan asks who is continuing the work; he and @omame have a call and Daniele continues the reboot work (link to record of which hosts?)
- 05-19 15:32 UTC Brian started checking which hosts actually have the LFS share mounted https://gitlab.slack.com/archives/C101F3796/p1495207948348488
- 05-19 16:32 UTC Stan asks Ernst to see what is holding up progress, Ernst reads issue and pings available team members, @ahanselka volunteers to determine status and continue the work.
What went well
- Root cause hypothesis of incorrect
timeo
setting was identified X?? hours after first reports, and hypothesis turned out to be correct. - Issue was made, other team members picked up on the queue, and
- a preventative fix for future re-occurrence was rolled out relatively quickly
- work to reboot all hosts was started
- Unscheduled work fully resolved in ~ 24 hours after issue was filed.
What can be improved
- How did the
timeo
variable get reverted back to 1 second? How could that reversion have been caught / prevented? - The hand-off of tasks from one team member to the next as we cycled through timezones and individuals' availabilities should be improved. This issue was not listed in the on-call log, presumably because it did not lead to an outage. However, perceived data loss is equally (or more?) critical, so the following actions are proposed:
-
Create the (perceived) data loss
label and place it highest in label priority ordering. -
this label needs to be documented and widely known -
clarify in handbook that issues with this label are to be written into the on-call log. -
clarify (how? where? facilitate how?) that handovers need to be explicit; although presumably that is already the case for issues in the on-call log.
-
Edited by Ernst van Nierop