NFS-02 outage
Context
At 00:50 UTC Jul 20 nfs-02 stopped sending metrics and went offline. Last received data was load in the thousands and 100% cpu usage.
Timeline
On date: 2017-07-20
- 00:50 UTC - pagerduty reports nfs-02 is down (start of the slack discussion: https://gitlab.slack.com/archives/C101F3796/p1500511840864796)
- 00:52 UTC - @briann and I confirm we can't access the server via ssh, although tcp handshake finishes fine. Azure console reports server is up w/o errors, but cpu usage is 100%.
- 00:54 We tweet status
- 01:03 We discover that ssh connection can get through up to the pubkey authentication, but still timeouts.
- 01:05 We're still seeing logs being sent from nfs-02 to logstash, with OOM invocation against gitaly process.
- 01:11 We decide not to perform hard reboot to avoid potential data corruption, but rather wait for it to stop trashing. Disabling gitaly features has no noticeable effect. At the same time, I (ilya) decide not to open a ticket cause I'm not seeing any Azure-caused effect this time.
- 01:20 We started investigating whether this was related to pushing a lot of tags in gitlab-ee
- 01:31 We're not seeing any logs from nfs-02 for 25 minutes.
- 01:41 We're attempting a restart of the box from Azure console. Another option would be stop/start, which will take 10-15 minites and might cause data corruption.
- 01:55 After several restart requests (all of which were reported as successful) some lonely acpi signal probably got through. Server is rebooted cleanly and we can access it over ssh.
- 01:58 Stan expired caches for nfs-02, we're confirming metrics are up in prometheus.
- 02:05 We start to analyze logs. First findings: 252 OOM invocations for gitaly and git processes from 00:48 to 01:02 UTC.
Root Cause Analysis
Why did this outage occur?
-
Why?
nfs-file-02stopped responding for about 50 minutes from 00h50 until 01h40 -
Why? The server was unable to handle new incoming requests due to being out-of-memory and CPU.
-
Why? About 800 https
post-upload-packrequests came in within a 3 minute period. Normally the server would be able to handle these relatively easily, but in this case, the requests took the server down. -
Why? Each
post-upload-packrequest needed to do more compression than normal since thegitlab-eerepository had recently had a large number of tags deleted and had not beengit gc'ed. If the GC had occurred, thepost-upload-packprocesses would have not each consumed as many resources as they did? -
Why? Sidekiq was down and therefore the usual
git gcprocesses had not taken place
What went well
- We didn't powercycle the host -- we avoided the potential data corruption and learned that ACPI might still pass through to the kernel long after Azure reported that the action is finished.
What can be improved
- We can limit the gitaly processes from eating all the ram by using cgroups, or containerizing the app in the long run.
Corrective actions
cc @gl-infra