GitLab outage 2017-09-04
Timeline of events is the following (all times are UTC):
[14:39] Pagerduty notification for GitLab.com site reporting down
[14:50] @ahmadsherif identified the root cause as gitaly
processes leaving <defunct>
processes
[15:04] @jacobvosmaer-gitlab identifies the problem in the source code - affecting LastCommitForPath
and FindCommit
features
[15:11] gitaly
rolling restart across all storage nodes - @eReGeBe issues bundle exec knife ssh -C 5 'roles:gitlab-base-stor-nfs' 'sudo gitlab-ctl restart gitaly'
[15:15] @victorcete restarting the rest of unresponsive NFS servers (nfs-file-07
to nfs-file-12
) directly from Azure panel
[15:27] all servers but nfs-file-08
are back up
[15:29] @victorcete restarts nfs-file-08
again from Azure
[15:33] GitLab.com is back
/cc @gl-infra