Skip to content

Bring back a subset of Rugged calls under a feature flag

Recently a customer noticed queued Unicorn workers and increased load after upgrading to GitLab 11.5.3 from 10.8.7:

image

More importantly, the total number of active and queued Unicorn workers also went up:

Before

image

image

After

image

image

We observed that many of the processes tended to be git cat-file processes waiting in the D state (uninterruptible disk sleep). This usually means there is an I/O wait on the NFS server. This explained why there was increased load (due to number of processes available to be run) but no corresponding increase in CPU load.

After applying the following merge requests to revert the following Gitaly RPCs back to the Rugged implementation, the system appeared to perform much better. These are 11.5 ports:

11.9 ports:

We suspect this may be due to a number of reasons:

  1. Increased I/O due to reloading refs and pack files. For example, loading the merge request widget (e.g. http://gitlab.example.com/TRYME/test-gitlab-bug2/merge_requests/2.json?serializer=widge) causes two FindCommit requests to be issued: one for the source branch, and one for the target branch. Previously we could reuse the same Rugged::Repository and avoid loading the repo pack file twice.
  2. N+1 queries introduced by the Gitaly implementation (e.g. https://gitlab.com/gitlab-org/gitlab-ce/issues/57107, https://gitlab.com/gitlab-org/gitlab-ce/issues/57114, https://gitlab.com/gitlab-org/gitlab-ce/issues/57113)
  3. NFS on spinning disk vs. SSDs. Spinning disk has much lower IOPS, which can slow random I/O accesses.
  4. git home directory mounted on an NFS directory. Any git process that runs will read the home .git/config and other files, which will slow things down.
  5. Users hitting the API hard and increased load from git upload-pack processes. Today we saw a node with 392 git upload-pack processes launched at the beginning of the hour.

Until we fully proven out Gitaly atop NFS, we should have a feature flag that allows use of the Rugged implementations of the aforementioned RPCs.

/cc: @jacobvosmaer-gitlab, @jwoods06, @lbot, @dblessing, @tcooney

Edited by Stan Hu