Web server becomes slow and eventually stops responding

Summary

We encounter a nearly daily problem where our GitLab instance becomes slow and then stops serving requests completely. (No error page from workhorse, all requests just time out.) For a while, this was happening at a nearly consistent time every day (around 3:15pm plus or minus 10 minutes).

Steps to reproduce

Unfortunately, we are not able to reproduce this. Lack of a reproducer is what has kept us from filing this for so long.

Things we suspect may cause this:

Creating commits via the API. The pre-recieve and post-recieve hooks often seem to belong to repositories that are manipulated via the API.
Deleting branches via the web UI. Sometimes, the pre-recieve and post-recieve hooks belong to a repository where one of our users often cleans up branches via the UI.

Expected behavior

GitLab should remain responsive. If requests are impacting server performance, we should be able to find evidence of them in logs so that we can troubleshoot better.

Actual behavior

When this happens, we see a large number of gitlab-shell processes stacked up. Often, but not always, all unicorn workers will have pre-receive or post-receive subprocesses. These are for repos with no webhooks configured.

Sometimes, after a few minutes, things return to normal. More often, we need to restart GitLab. Killing the pre-receive and post-receive subprocesses will often make things responsive again; but when doing so the server sometimes returns to the same unresponsive state after a few minutes. Increasing the number of unicorn processes (from 4 to 8) seems to make this happen less frequently, but does not eliminate the problem. With 8 workers, we'e seen the server become unresponsive with only 4 of the unicorns having pre-receive or post-receive subprocesses.

While this is happening, there is no unusual load, CPU usage, or memory usage. There are no abnormal log messages in any of GitLab's logs. Postgres logs do not indicate any long running queries that correlate with this problem happening. There do not appear to be sidekiq jobs stacked up, and the sidekiq process never goes past 2 or 3 of 25 busy before returning to 0 busy.

Relevant logs and/or screenshots

Unfortunately, we have not found anything relevant in GitLab's logs.

Output of checks

Results of GitLab application Check

Notes:

/home/user/git/repositories is owned by user git, but in a different group, because of how group management works in our organization.
/home/user/git/repositories is drwxrwx---. In our installation, everything runs as user git so no setgid bit should be necessary.
/home/user/git/gitlabhq/shared/uploads/ directory has less restrictive permissions than 700, we could certainly fix this, but it doesn't seem relevant to this problem
Init script: We use a custom init mechanism in our organization.
git version: I'm super surprised that we're using our distro's git rather than a more recent one. I'm happy to switch to a git version that's supported by GitLab.

Checking GitLab Shell ...

GitLab Shell version >= 2.7.2 ? ... OK (2.7.2)
Repo base directory exists? ... yes
Repo base directory is a symlink? ... no
Repo base owned by git:git? ... no
  User id for git: 2266. Groupd id for git: group git doesn't exist
  Try fixing it:
  sudo chown -R git:git /home/user/git/repositories/
  For more information see:
  doc/install/installation.md in section "GitLab Shell"
  Please fix the error above and rerun the checks.
Repo base access is drwxrws---? ... no
  Try fixing it:
  sudo chmod -R ug+rwX,o-rwx /home/user/git/repositories/
  sudo chmod -R ug-s /home/user/git/repositories/
  sudo find /home/user/git/repositories/ -type d -print0 | sudo xargs -0 chmod g+s
  For more information see:
  doc/install/installation.md in section "GitLab Shell"
  Please fix the error above and rerun the checks.
hooks directories in repos are links: ... 
<removed>
Redis version >= 2.8.0? ... yes
Ruby version >= 2.1.0 ? ... yes (2.1.0)
Your git bin path is "git"
Git version >= 2.7.3 ? ... no
  Try fixing it:
  Update your git to a version >= 2.7.3 from 1.8.3
  Please fix the error above and rerun the checks.
Active users: 1734

Checking GitLab ... Finished

Results of GitLab environment info

System information
System:         RedHatEnterpriseServer 6.7
Current User:   git
Using RVM:      no
Ruby Version:   2.1.0p0
Gem Version:    2.2.0
Bundler Version:1.5.3
Rake Version:   10.5.0
Sidekiq Version:4.1.2

GitLab information
Version:        8.8.3
Revision:       62323c3
Directory:      /home/user/git/gitlabhq/releases/62323c3b1ddd79fd7643421ee035a7bd5627e713
DB Adapter:     postgresql
URL:            https://gitlab.example.com
HTTP Clone URL: https://gitlab.example.com/some-group/some-project.git
SSH Clone URL:  git@gitlab.example.com:some-group/some-project.git
Using LDAP:     yes
Using Omniauth: no

GitLab Shell
Version:        2.7.2
Repositories:   /home/user/git/repositories/
Hooks:          /home/user/git/gitlab-shell/hooks/
Git:            git

Possible fixes

We've changed HTTP server configurations several times, and each of these seems to have made the problem better, but none of them have fixed it:

Initially, we had Apache proxying certian URLs to gitlab-workhorse, and the rest to Unicorn, over network sockets. (We had missed a step during one of the GitLab upgrades in which workhorse was intended to be a reverse proxy for everything). In this configuration, this problem almost always resulted in the site going completely down and GitLab needing a restart when this problem occurred. We would see all available connection slots in Apache stack up, and the Unicorn workers hanging in a ppoll of a file handle (the file handle that they usually periodically ppoll).
We changed our configuration so that requests were proxied apache -> workhorse -> unicorn, again over network sockets. After doing this, we still saw the same behavior, but it took longer between when requests started to get slow until the instance was completely unresponsive.
We switched to nginx from Apache, so that requests were proxied nginx -> workhorse -> unicorn over network sockets. We still had the same kind of outages as before, but we started to see cases where GitLab would be unresponsive but recover on its own a few minutes later. We suspect that Apache was deciding that the backend was unresponsive and giving up on it in the earlier configuration.
We switched to unix domain sockets for communication from nginx to the workhorse and from the workhorse to unicorn, in an effort to be as close to the GitLab recommended configuration as possible. After doing this, we began to notice the pre-receive and post-receive symptom a higher proportion of the time when the unresponsiveness happens.

#13947 (closed) appears to be the closest issue I've found to this behavior.