Unicorn timeouts on web fleet causing site to be unresponsive unavailable for a short duration 2017-09-26
This issue is to track the work that was done to react to the site issues we experienced on 2017-09-26 when the site became slow and unresponsive. It was noticed immediately that workers were timing out:
E, [2017-09-26T08:39:28.064243 #31410] ERROR -- : worker=11 PID:37408 timeout (61s > 60s), killing
E, [2017-09-26T08:39:28.097965 #31410] ERROR -- : reaped #<Process::Status: pid 37408 SIGKILL (signal 9)> worker=11
I, [2017-09-26T08:39:28.354893 #40057] INFO -- : worker=11 ready
E, [2017-09-26T08:39:41.122040 #31410] ERROR -- : worker=28 PID:36183 timeout (61s > 60s), killing
E, [2017-09-26T08:39:41.167844 #31410] ERROR -- : reaped #<Process::Status: pid 36183 SIGKILL (signal 9)> worker=28
I, [2017-09-26T08:39:41.429438 #40405] INFO -- : worker=28 ready
- some notes from the call: https://docs.google.com/document/d/194UGMh8Q7aneJ3Ec6gxGu50Sfr-nKJDTDskT1r9D7jM/edit
- strace of worker right before timeout: https://drive.google.com/open?id=0Bx7TBS6nz20OeDBpYlB3Q1BQRzA
Timeline:
- 08:11 Pingdom reports GitLab down
- 08:17 Pingdom reports GitLab up
- 08:30 Noticed timeouts in the unicorn log
- 08:43 Turn off gitaly_branch_names
- Later in the day we decided to bring up additional web nodes, increasing them from 7 to 11.
- https://gitlab.com/gitlab-com/gitlab-com-infrastructure/merge_requests/160
- https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1168
-
We believe that the additional stress on the web nodes may be to lack of caching on redis, as we recently brought up a cold cache which was an outcome of the outage the day before https://gitlab.com/gitlab-com/infrastructure/issues/2855 .
-
A regression is putting additional load on the api fleet, see https://gitlab.com/gitlab-org/gitlab-ce/issues/38438We have only been seeing this additional pressure on the API fleet since around the 15th https://performance.gitlab.net/dashboard/db/fleet-overview?refresh=5m&panelId=40&fullscreen&orgId=1&from=now%2FM&to=now%2FM