NFS-02 outage

Context

At 00:50 UTC Jul 20 nfs-02 stopped sending metrics and went offline. Last received data was load in the thousands and 100% cpu usage.

Timeline

On date: 2017-07-20

00:50 UTC - pagerduty reports nfs-02 is down (start of the slack discussion: https://gitlab.slack.com/archives/C101F3796/p1500511840864796)
00:52 UTC - @briann and I confirm we can't access the server via ssh, although tcp handshake finishes fine. Azure console reports server is up w/o errors, but cpu usage is 100%.
00:54 We tweet status
01:03 We discover that ssh connection can get through up to the pubkey authentication, but still timeouts.
01:05 We're still seeing logs being sent from nfs-02 to logstash, with OOM invocation against gitaly process.
01:11 We decide not to perform hard reboot to avoid potential data corruption, but rather wait for it to stop trashing. Disabling gitaly features has no noticeable effect. At the same time, I (ilya) decide not to open a ticket cause I'm not seeing any Azure-caused effect this time.
01:20 We started investigating whether this was related to pushing a lot of tags in gitlab-ee
01:31 We're not seeing any logs from nfs-02 for 25 minutes.
01:41 We're attempting a restart of the box from Azure console. Another option would be stop/start, which will take 10-15 minites and might cause data corruption.
01:55 After several restart requests (all of which were reported as successful) some lonely acpi signal probably got through. Server is rebooted cleanly and we can access it over ssh.
01:58 Stan expired caches for nfs-02, we're confirming metrics are up in prometheus.
02:05 We start to analyze logs. First findings: 252 OOM invocations for gitaly and git processes from 00:48 to 01:02 UTC.

Root Cause Analysis

Why did this outage occur?

Why? nfs-file-02 stopped responding for about 50 minutes from 00h50 until 01h40
Why? The server was unable to handle new incoming requests due to being out-of-memory and CPU.
Why? About 800 https post-upload-pack requests came in within a 3 minute period. Normally the server would be able to handle these relatively easily, but in this case, the requests took the server down.
Why? Each post-upload-pack request needed to do more compression than normal since the gitlab-ee repository had recently had a large number of tags deleted and had not been git gc'ed. If the GC had occurred, the post-upload-pack processes would have not each consumed as many resources as they did?
Why? Sidekiq was down and therefore the usual git gc processes had not taken place

What went well

We didn't powercycle the host -- we avoided the potential data corruption and learned that ACPI might still pass through to the kernel long after Azure reported that the action is finished.

What can be improved

We can limit the gitaly processes from eating all the ram by using cgroups, or containerizing the app in the long run.

Corrective actions

https://gitlab.com/gitlab-com/infrastructure/issues/2364

cc @gl-infra

Edited Jul 27, 2017 by Ilya Frolov