Skip to content

Post-mortem on missing NFS mounts

Original issue and timeline

What went well

  • Root cause hypothesis of incorrect timeo setting was identified X?? hours after first reports, and hypothesis turned out to be correct.
  • Issue was made, other team members picked up on the queue, and
    • a preventative fix for future re-occurrence was rolled out relatively quickly
    • work to reboot all hosts was started
  • Unscheduled work fully resolved in ~ 24 hours after issue was filed.

What can be improved

  • How did the timeo variable get reverted back to 1 second? How could that reversion have been caught / prevented?
  • The hand-off of tasks from one team member to the next as we cycled through timezones and individuals' availabilities should be improved. This issue was not listed in the on-call log, presumably because it did not lead to an outage. However, perceived data loss is equally (or more?) critical, so the following actions are proposed:
    • Create the (perceived) data loss label and place it highest in label priority ordering.
    • this label needs to be documented and widely known
    • clarify in handbook that issues with this label are to be written into the on-call log.
    • clarify (how? where? facilitate how?) that handovers need to be explicit; although presumably that is already the case for issues in the on-call log.
Edited by Ernst van Nierop