Git09 oopsed but there was no alert
UTC times:
- 03:40 I noticed that I can't ssh into it, @briann noticed git09 had an oops: https://log.gitlap.com/goto/bd5a84d891d5785300e8b2f5036e737d
- 03:50 I attempted to restart it from Azure
- 03:55 That reported success but didn't work, so I stopped the node
- 04:05 Started the node
Main issue is that during this there were no alerts, neither when sshd was locked, nor when node was shutting down. I was concerned about losing 1/12 of our git traffic, and since there's no way to access the the box from console, I didn't do much diagnostics. I just verified that ssh was not accessible (both from outside and git08), and proceeded with restart.
Can we improve our monitoring? Should we do so? Will it result in flaky hosts? cc @gl-infra