NFS-02 outage

Context

At 00:50 UTC Jul 20 nfs-02 stopped sending metrics and went offline. Last received data was load in the thousands and 100% cpu usage.

Timeline

On date: 2017-07-20

  • 00:50 UTC - pagerduty reports nfs-02 is down (start of the slack discussion: https://gitlab.slack.com/archives/C101F3796/p1500511840864796)
  • 00:52 UTC - @briann and I confirm we can't access the server via ssh, although tcp handshake finishes fine. Azure console reports server is up w/o errors, but cpu usage is 100%.
  • 00:54 We tweet status
  • 01:03 We discover that ssh connection can get through up to the pubkey authentication, but still timeouts.
  • 01:05 We're still seeing logs being sent from nfs-02 to logstash, with OOM invocation against gitaly process.
  • 01:11 We decide not to perform hard reboot to avoid potential data corruption, but rather wait for it to stop trashing. Disabling gitaly features has no noticeable effect. At the same time, I (ilya) decide not to open a ticket cause I'm not seeing any Azure-caused effect this time.
  • 01:20 We started investigating whether this was related to pushing a lot of tags in gitlab-ee
  • 01:31 We're not seeing any logs from nfs-02 for 25 minutes.
  • 01:41 We're attempting a restart of the box from Azure console. Another option would be stop/start, which will take 10-15 minites and might cause data corruption.
  • 01:55 After several restart requests (all of which were reported as successful) some lonely acpi signal probably got through. Server is rebooted cleanly and we can access it over ssh.
  • 01:58 Stan expired caches for nfs-02, we're confirming metrics are up in prometheus.
  • 02:05 We start to analyze logs. First findings: 252 OOM invocations for gitaly and git processes from 00:48 to 01:02 UTC.

Root Cause Analysis

Why did this outage occur?

  • Why? nfs-file-02 stopped responding for about 50 minutes from 00h50 until 01h40

  • Why? The server was unable to handle new incoming requests due to being out-of-memory and CPU.

  • Why? About 800 https post-upload-pack requests came in within a 3 minute period. Normally the server would be able to handle these relatively easily, but in this case, the requests took the server down.

  • Why? Each post-upload-pack request needed to do more compression than normal since the gitlab-ee repository had recently had a large number of tags deleted and had not been git gc'ed. If the GC had occurred, the post-upload-pack processes would have not each consumed as many resources as they did?

  • Why? Sidekiq was down and therefore the usual git gc processes had not taken place

What went well

  • We didn't powercycle the host -- we avoided the potential data corruption and learned that ACPI might still pass through to the kernel long after Azure reported that the action is finished.

What can be improved

  • We can limit the gitaly processes from eating all the ram by using cgroups, or containerizing the app in the long run.

Corrective actions

cc @gl-infra

Edited by Ilya Frolov