Memory and CPU cgroups for Gitaly git processes

Problem statement

Currently any one git repo can saturate a Gitaly node's CPU or memory, effectively causing slowness and timeouts for all other repos mounted on that Gitaly node. Ideally we should not allow one repo's workload to starve other colocated repos and impact availability.

Rationale

For Linux-based GitLab deploys, using cgroups can provide a customizable degree of resource isolation, reducing the scope of saturation-induced failures to approximately match just the repos that are inducing the saturation.

This aims to improve availability during both accidental and malicious spikes in workload. Under normal workload it aims to cost minimal overhead and a configurable reduction of resources allotted to each cgroup.

This proposal focuses on two cgroup resource controllers: cpu and memory. On GitLab.com's multi-tenant Gitaly nodes, those two resources often spike in tandem and are the most common resources to saturate. Similar considerations probably also apply to the blkio resource controller for nodes that have a modest disk IOPS capacity and a contended filesystem cache, so consider adding it to the proposal once the sketch here is firmer.

Production incidents

To show this problem is a recurring theme, here are a few of the examples over the last year:

gitlab-com/gl-infra/production#2528 (closed) - 2020-08-17 - CPU saturation. Resolved when end-user stopped their highly concurrent usage pattern.
gitlab-com/gl-infra/production#2457 (closed) - 2020-07-27 - Memory and CPU saturation. Resolved when end-user stopped their highly concurrent usage pattern.
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1774 - 2020-03-16 - Memory and CPU saturation. Resolved when end-user stopped their highly concurrent usage pattern.
gitlab-com/gl-infra/production#1126 (closed) - 2019-09-04 - CPU saturation. Sustained abuse was mitigated via rate-limiting: Temporarily reduced max rate of SSHUploadPack gRPC calls to Gitaly from 120 to 15.

Complimentary proposals to improve resource usage efficiency

While the present issue aims to reduce the scope of impact when resource saturation does occur, here are some complimentary issues aiming to reduce the probability of reaching saturation.

These issues propose adding optional configurable features to reduce usage of a constrained resource (memory) by spending a more plentiful resource (disk IOPS) or a resource with less severe impact when saturated (CPU).

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11019 - Look for ways to reduce the amount of anonymous memory used by each git-pack-objects process.
#3003 (closed) - Reduce the concurrency of memory-hungry git-pack-objects processes by shortening their lifespan.

Proposal

At startup time, create a pool of N cgroups (e.g. N=1000), each of which has a budget of P percent of the parent cgroup's budget (e.g. P = 25%). Note that this is intentionally heavily oversubscribed, for reasons explained shortly.

At run time, schedule each git process to one of the N cgroups based on a deterministic hash of some repo metadata (e.g. the project id). This deterministic selection of cgroup is what provides the desired isolation. Hashing based on a repo's unique id gives the most granular grouping, although grouping by namespace may be a good alternative.

Choosing a large value for N means spreading git repos thinly among cgroups, such that if one cgroup is saturated, most other repos will not be, due to being assigned to different cgroups. Effectively today N=1.

Choosing a moderate value for P constrains each cgroup to use a smaller fraction of the resource (memory, cpu, etc.). Effectively today P=100%. The minimum requirements for any one repo depend on some repo-specific properties (e.g. object count and size distribution) and some runtime properties (e.g. concurrently running git processes). Some of these properties are already constrained (e.g. rate-limiting, max repo size) and some are not. So even with P=100% (90 GB), saturation could occur. Choosing a P smaller than 100% gives other cgroups a chance to survive when one cgroup is saturated.

With P = 25% and N = 1000, when any 1 of the N cgroups is saturated, the other cgroups should continue to be close to their normal memory pressure. 99.9% of repos would remain largely unaffected, still sharing the 75% of memory that remains, whereas the 0.1% of repos in the saturated cgroup would be unavailable as long as the saturation lasts. If a second repo in a different cgroup on that same Gitaly node also saturates its cgroup, then the remaining 50% of memory will be shared by the other 998 cgroups, which depending on workload may start to feel some memory pressure but should still be much more available than without the cgroups' isolation.

Constraints

Tuning the cgroups size and count

Setting P too small can cause processes to fail, either intermittently or consistently.

Practically speaking, to benefit from this scheme, P should probably be between 50% and 10%. It must equate to at least a few gigabytes to support modest concurrency on large repos and bursts of concurrency on small repos.

Sketch of memory saturation behavior:

If cloning a particular git repo requires more memory than the cgroup currently has free, the kernel will try to reclaim memory (e.g. by evicting filesystem cache pages).
- This can lead to increased physical disk I/O under some conditions. During performance testing, pay attention to changes in disk IOPS and major page fault rate.
If enough memory cannot be reclaimed to satisfy the demand, at least one process within the cgroup will be killed by kernel's out-of-memory-killer. This of course leads to intermittent failure.
If the cgroup's memory limit is too small to satisfy demand even when no other processes are competing within that cgroup, this represents the worst case of consistent failure. This can occur if a repo undergoes a rapid growth. It is a risk factor with or without using memory cgroups, but since the cgroups have a lower limit than the host's memory budget, it will happen sooner and can be worked-around by growing the cgroup's limit, without adding physical RAM to the host.

Since memory pressure can affect any of the processes belonging to a cgroup, scheduling more than one git repo's processes into the same cgroup means they still share a budget. Choosing a large value for N reduces that potential for contention. Dynamically creating a new cgroup on demand for each repo with active processes avoids that risk, but it would also put more accounting burden on the kernel. If we choose to explore that design option, we should first explicitly profile the kernel's overhead when concurrently creating and destroying cgroups many times per second.

Preserving existing functionality

To state the obvious: Gitaly should work with or without cgroups.

Not all GitLab deployments will want to or be able to use cgroups.

Some environments may already be running Gitaly in a container. That will determine whether cgroups v1 or v2 is being used for each cgroup controller, and thus will affect which hierarchy to use.

Some environments may not even have the cgroup and/or cgroup2 filesystem mounted. Systemd-based distributions typically mount them respectively under /sys/fs/cgroup/* and /sys/fs/cgroup/unified. But checking /proc/mounts may be prudent.

Background: Current limited use of cgroups by Gitaly on GitLab.com

On GitLab.com, Gitaly already runs with itself and all its child processes (git, ruby, etc.) in a single communal cgroup.

# Show the "gitaly" daemon's cgroup list.

$ grep -w -e 'cpu' -e 'memory' -e 'blkio' /proc/$( pidof 'gitaly' )/cgroup
6:blkio:/system.slice/gitlab-runsvdir.service
3:memory:/system.slice/gitlab-runsvdir.service
2:cpu,cpuacct:/system.slice/gitlab-runsvdir.service

# How many processes belong to that cgroup for each controller?

$ wc -l /sys/fs/cgroup/{cpu,memory,blkio}/system.slice/gitlab-runsvdir.service/cgroup.procs
251 /sys/fs/cgroup/cpu/system.slice/gitlab-runsvdir.service/cgroup.procs
251 /sys/fs/cgroup/memory/system.slice/gitlab-runsvdir.service/cgroup.procs
251 /sys/fs/cgroup/blkio/system.slice/gitlab-runsvdir.service/cgroup.procs
753 total

# Summarize those processes.

$ cat /sys/fs/cgroup/cpu/system.slice/gitlab-runsvdir.service/cgroup.procs | xargs -r ps -o comm | sort | uniq -c
      1 COMMAND
    218 git
      1 gitaly
      1 gitaly-wrapper
      1 gitlab-logrotat
      1 git-remote-http
     19 ruby
      2 runsv
      1 runsvdir
      1 sleep
      2 svlogd

# Show that this cgroup has no nested child cgroups.

$ find /sys/fs/cgroup/cpu/system.slice/gitlab-runsvdir.service/ -mindepth 1 -type d | wc -l
0

This cgroup is setup for it by the systemd unit gitlab-runsvdir.service. The MemoryLimit and CPUShares directives that create the memory and cpu cgroups comes from a Chef-managed supplemental "drop-in" config file (override.conf):

$ systemctl cat gitlab-runsvdir.service
# /usr/lib/systemd/system/gitlab-runsvdir.service
[Unit]
Description=GitLab Runit supervision process
After=multi-user.target

[Service]
ExecStart=/opt/gitlab/embedded/bin/runsvdir-start
Restart=always
TasksMax=4915

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/gitlab-runsvdir.service.d/override.conf
[Service]
CPUShares=1024
MemoryLimit=90G
TasksMax=16384

The base systemd unit file is generated by the gitlab omnibus package's cookbook recipe runit_systemd:

/opt/gitlab/embedded/cookbooks/package/recipes/runit_systemd.rb
/opt/gitlab/embedded/cookbooks/package/templates/default/gitlab-runsvdir.service.erb

That recipe could be trivially extended to optionally include systemd directives for either cgroups v1 (MemoryLimit and CPUShares) or cgroups v2 (MemoryMax and CPUWeight).

Doing so is not a prerequisite to adding a Gitaly feature for creating child cgroups under whatever its own cgroup is. Whether the gitaly daemon is a member of the root cgroup or some other more constrained cgroup, Gitaly itself can attempt to create child cgroups and add processes to them. If creating the cgroup fails, Gitaly can continue as normal, allowing its child processes to implicitly run in the current cgroup, as happens today.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information