NFS-02 outage
## Context At 00:50 UTC Jul 20 nfs-02 stopped sending metrics and went offline. Last received data was load in the thousands and 100% cpu usage. ## Timeline On date: 2017-07-20 - 00:50 UTC - pagerduty reports nfs-02 is down (start of the slack discussion: https://gitlab.slack.com/archives/C101F3796/p1500511840864796) - 00:52 UTC - @briann and I confirm we can't access the server via ssh, although tcp handshake finishes fine. Azure console reports server is up w/o errors, but cpu usage is 100%. - 00:54 We tweet status - 01:03 We discover that ssh connection can get through up to the pubkey authentication, but still timeouts. - 01:05 We're still seeing logs being sent from nfs-02 to logstash, with OOM invocation against gitaly process. - 01:11 We decide not to perform hard reboot to avoid potential data corruption, but rather wait for it to stop trashing. Disabling gitaly features has no noticeable effect. At the same time, I (ilya) decide not to open a ticket cause I'm not seeing any Azure-caused effect this time. - 01:20 We started investigating whether this was related to pushing a lot of tags in gitlab-ee - 01:31 We're not seeing any logs from nfs-02 for 25 minutes. - 01:41 We're attempting a restart of the box from Azure console. Another option would be stop/start, which will take 10-15 minites and might cause data corruption. - 01:55 After several restart requests (all of which were reported as successful) some lonely acpi signal probably got through. Server is rebooted cleanly and we can access it over ssh. - 01:58 Stan expired caches for nfs-02, we're confirming metrics are up in prometheus. - 02:05 We start to analyze logs. First findings: 252 OOM invocations for gitaly and git processes from 00:48 to 01:02 UTC. ## Root Cause Analysis ### Why did this outage occur? - Why? `nfs-file-02` stopped responding for about 50 minutes from 00h50 until 01h40 - Why? The server was unable to handle new incoming requests due to being out-of-memory and CPU. - Why? About 800 https `post-upload-pack` requests came in within a 3 minute period. Normally the server would be able to handle these relatively easily, but in this case, the requests took the server down. - Why? Each `post-upload-pack` request needed to do more compression than normal since the `gitlab-ee` repository had recently had a large number of tags deleted and had not been `git gc`'ed. If the GC had occurred, the `post-upload-pack` processes would have not each consumed as many resources as they did? - Why? Sidekiq was down and therefore the usual `git gc` processes had not taken place ## What went well - We didn't powercycle the host -- we avoided the potential data corruption and learned that ACPI might still pass through to the kernel long after Azure reported that the action is finished. ## What can be improved - We can limit the gitaly processes from eating all the ram by using cgroups, or containerizing the app in the long run. ## Corrective actions - https://gitlab.com/gitlab-com/infrastructure/issues/2364 cc @gl-infra
issue