Bring back a subset of Rugged calls under a feature flag
Recently a customer noticed queued Unicorn workers and increased load after upgrading to GitLab 11.5.3 from 10.8.7:
More importantly, the total number of active and queued Unicorn workers also went up:
Before
After
We observed that many of the processes tended to be git cat-file
processes waiting in the D
state (uninterruptible disk sleep). This usually means there is an I/O wait on the NFS server. This explained why there was increased load (due to number of processes available to be run) but no corresponding increase in CPU load.
After applying the following merge requests to revert the following Gitaly RPCs back to the Rugged implementation, the system appeared to perform much better. These are 11.5 ports:
- FindCommit: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/9377
- GetTreeEntries: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/9403
- TreeEntry: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/9404
- CommitIsAncestor: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/9405
- CommitTreeEntry: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/9989
-
FindDefaultBranchName: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/9529Not needed
11.9 ports:
- https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/25477
- https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/25702
- https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/25706
- https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/25674
- https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/25896
-
ListCommitsByOid
: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/27441
We suspect this may be due to a number of reasons:
- Increased I/O due to reloading refs and pack files. For example, loading the merge request widget (e.g.
http://gitlab.example.com/TRYME/test-gitlab-bug2/merge_requests/2.json?serializer=widge
) causes two FindCommit requests to be issued: one for the source branch, and one for the target branch. Previously we could reuse the sameRugged::Repository
and avoid loading the repo pack file twice. - N+1 queries introduced by the Gitaly implementation (e.g. https://gitlab.com/gitlab-org/gitlab-ce/issues/57107, https://gitlab.com/gitlab-org/gitlab-ce/issues/57114, https://gitlab.com/gitlab-org/gitlab-ce/issues/57113)
- NFS on spinning disk vs. SSDs. Spinning disk has much lower IOPS, which can slow random I/O accesses.
-
git
home directory mounted on an NFS directory. Anygit
process that runs will read the home.git/config
and other files, which will slow things down. - Users hitting the API hard and increased load from
git upload-pack
processes. Today we saw a node with 392git upload-pack
processes launched at the beginning of the hour.
Until we fully proven out Gitaly atop NFS, we should have a feature flag that allows use of the Rugged implementations of the aforementioned RPCs.
/cc: @jacobvosmaer-gitlab, @jwoods06, @lbot, @dblessing, @tcooney