Investigate why some gitaly git processes are not assigned to cgroups

Problem summary

When Gitaly spawns child processes such as git or gitaly-hooks, it normally assigns the new process to a per-repo cgroup.

These cgroups provide per-project limits on memory and CPU consumption.

Today I unexpectedly found that intermittently file-cny-01 sometimes has a large minority of its git processes running outside of the Gitaly-managed cgroup pool. Instead, they were running in the same cgroup as gitaly itself. (See example below.)

This implies Gitaly failed to explicitly assign these processes to cgroups when they were created.

As I recall, one of our design decisions was to allow process creation to continue if something went wrong with cgroup assignment, so that any corner cases would not cause gRPC failures. So this is only a practical problem if many processes end up running unconfined (since that would thwart the cgroup resource limits that provide fair insulation between projects).

This issue is to investigate why many git processes are intermittently not being assigned to gitaly-managed cgroups. This was discovered on file-cny-01, but it presumably can occur on other Gitaly nodes too.

Why does this matter?

This can potentially increase the likelihood of Gitaly saturation incidents, since it bypasses the cgroups protections.

The cgroups mechanism aims to prevent any one repo from starving all others on a Gitaly node for CPU or memory. Any processes not assigned to a per-repo cgroup are evading that safety mechanism and implicitly run with no limit (as they did before we implemented cgroups).

The Gitaly-related incident rate dropped significantly since we rolled out cgroups, and they have proven to be an effective mitigation. Solving this gap would increase their effectiveness.

Example

Gitaly's config.toml configures Gitaly to create and manage 1000 CPU cgroups under /sys/fs/cgroup/cpu/gitaly/gitaly-[pid]/repos-{0..999}.

msmiley@file-cny-01-stor-gprd.c.gitlab-production.internal:~$ sudo cat /var/opt/gitlab/gitaly/config.toml | less
...
[cgroups]
mountpoint = "/sys/fs/cgroup"
hierarchy_root = "gitaly"
memory_bytes = 96636764160
cpu_shares = 1024

  [cgroups.repositories]
  count = 1000
  memory_bytes = 64424509440
  cpu_shares = 512
msmiley@file-cny-01-stor-gprd.c.gitlab-production.internal:~$ sudo find /sys/fs/cgroup/cpu/gitaly/ -mindepth 2 -maxdepth 2 -type d | wc -l
1000

msmiley@file-cny-01-stor-gprd.c.gitlab-production.internal:~$ sudo find /sys/fs/cgroup/cpu/gitaly/ -mindepth 2 -maxdepth 2 -type d | sort -V | tail -n5
/sys/fs/cgroup/cpu/gitaly/gitaly-4004643/repos-995
/sys/fs/cgroup/cpu/gitaly/gitaly-4004643/repos-996
/sys/fs/cgroup/cpu/gitaly/gitaly-4004643/repos-997
/sys/fs/cgroup/cpu/gitaly/gitaly-4004643/repos-998
/sys/fs/cgroup/cpu/gitaly/gitaly-4004643/repos-999

When Gitaly spawns a new child process (e.g. git), it normally assigns that process to one of those 1000 cgroups, choosing one based on a hash of the repo's path.

The following output surveys all running git processes and shows which cgroup they are assigned to:

msmiley@file-cny-01-stor-gprd.c.gitlab-production.internal:~$ date ; pgrep -x git | xargs -i grep -w cpu /proc/{}/cgroup 2> /dev/null | sort | uniq -c
Wed 08 Mar 2023 07:12:11 PM UTC
    197 10:cpu,cpuacct:/gitaly/gitaly-4004643/repos-173
      2 10:cpu,cpuacct:/gitaly/gitaly-4004643/repos-81
      2 10:cpu,cpuacct:/gitaly/gitaly-4004643/repos-899
      2 10:cpu,cpuacct:/gitaly/gitaly-4004643/repos-913
     42 10:cpu,cpuacct:/system.slice/gitlab-runsvdir.service

The high count for cgroup repos-173 is expected, because the canary Gitaly node includes one repo that is typically much more active than all others. Consequently, its cgroup tends to have many more processes than the others.

However, we do not expect to have a high count of git processes in cgroup /system.slice/gitlab-runsvdir.service. This is the subject of our investigation.

Those git processes seem like they should have been assigned to per-repo cgroups. Here are a few examples of those processes (the first 10 out of the 42 processes listed above. Notes on these processes:

  • They represent several distinct repo paths, not just one.
  • The current gitaly process (PID 4004643) is their parent, so they were spawned by gitaly.
  • Their process age varies, so they are not just extremely young processes.
  • At least two distinct git subcommands are represented: pack-objects and upload-pack
msmiley@file-cny-01-stor-gprd.c.gitlab-production.internal:~$ date ; for PID in $( pgrep -x git ) ; do grep -q 'cpu,cpuacct:/system.slice/gitlab-runsvdir.service' /proc/$PID/cgroup 2> /dev/null && echo $PID ; done | xargs -r ps -fww
Wed 08 Mar 2023 07:12:26 PM UTC
UID          PID    PPID  C STIME TTY      STAT   TIME CMD
git      3068450 4004643  0 18:28 ?        S      0:05 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/2b/25/2b2517a9dad97e502671d96f4d5c5d0ecba17476beddacf1c70251481304388e.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c uploadpack.allowFilter=true -c uploadpack.allowAnySHA1InWant=true -c uploadpack.hideRefs=refs/remotes/ -c uploadpack.hideRefs=refs/tmp/ -c uploadpack.hideRefs=refs/keep-around/ -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 -c core.hooksPath=/var/opt/gitlab/gitaly/run/gitaly-4004643/hooks-725560078.d -c uploadpack.packObjectsHook=/var/opt/gitlab/gitaly/run/gitaly-4004643/gitaly-hooks upload-pack --stateless-rpc --end-of-options /var/opt/gitlab/git-data/repositories/@hashed/2b/25/2b2517a9dad97e502671d96f4d5c5d0ecba17476beddacf1c70251481304388e.git
git      3068988 4004643  2 18:28 ?        S      0:59 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/2b/25/2b2517a9dad97e502671d96f4d5c5d0ecba17476beddacf1c70251481304388e.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 pack-objects --stdout --revs --thin --delta-base-offset --end-of-options
git      3127985 4004643  0 18:35 ?        S      0:01 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/0d/b6/0db61c6f2b4da90400de7684806e5ab4e0cb84a866bf87cf5ca61a832f01bd98.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c uploadpack.allowFilter=true -c uploadpack.allowAnySHA1InWant=true -c uploadpack.hideRefs=refs/remotes/ -c uploadpack.hideRefs=refs/tmp/ -c uploadpack.hideRefs=refs/keep-around/ -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 -c uploadpack.allowFilter=true -c uploadpack.allowAnySHA1InWant=true -c receive.maxInputSize=5242880000 -c core.hooksPath=/var/opt/gitlab/gitaly/run/gitaly-4004643/hooks-725560078.d -c uploadpack.packObjectsHook=/var/opt/gitlab/gitaly/run/gitaly-4004643/gitaly-hooks upload-pack --end-of-options /var/opt/gitlab/git-data/repositories/@hashed/0d/b6/0db61c6f2b4da90400de7684806e5ab4e0cb84a866bf87cf5ca61a832f01bd98.git
git      3130053 4004643  0 18:35 ?        S      0:19 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/0d/b6/0db61c6f2b4da90400de7684806e5ab4e0cb84a866bf87cf5ca61a832f01bd98.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 pack-objects --stdout --revs --thin --progress --delta-base-offset --end-of-options
git      3137719 4004643  0 18:36 ?        S      0:00 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/66/5d/665d5447adfbca31d7b1e6aaf3e806c16ccc8265f89fa43c4e316be2615c4bb2.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c uploadpack.allowFilter=true -c uploadpack.allowAnySHA1InWant=true -c uploadpack.hideRefs=refs/remotes/ -c uploadpack.hideRefs=refs/tmp/ -c uploadpack.hideRefs=refs/keep-around/ -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 -c core.hooksPath=/var/opt/gitlab/gitaly/run/gitaly-4004643/hooks-725560078.d -c uploadpack.packObjectsHook=/var/opt/gitlab/gitaly/run/gitaly-4004643/gitaly-hooks upload-pack --stateless-rpc --end-of-options /var/opt/gitlab/git-data/repositories/@hashed/66/5d/665d5447adfbca31d7b1e6aaf3e806c16ccc8265f89fa43c4e316be2615c4bb2.git
git      3137756 4004643  0 18:36 ?        S      0:18 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/66/5d/665d5447adfbca31d7b1e6aaf3e806c16ccc8265f89fa43c4e316be2615c4bb2.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 pack-objects --stdout --revs --thin --delta-base-offset --end-of-options
git      3139794 4004643  0 18:37 ?        S      0:00 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/62/7f/627fe5175ddbb7378afaa1dfcdb18083762131cbe3d28e85d9a92b531854067f.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c uploadpack.allowFilter=true -c uploadpack.allowAnySHA1InWant=true -c uploadpack.hideRefs=refs/remotes/ -c uploadpack.hideRefs=refs/tmp/ -c uploadpack.hideRefs=refs/keep-around/ -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 -c core.hooksPath=/var/opt/gitlab/gitaly/run/gitaly-4004643/hooks-725560078.d -c uploadpack.packObjectsHook=/var/opt/gitlab/gitaly/run/gitaly-4004643/gitaly-hooks upload-pack --stateless-rpc --end-of-options /var/opt/gitlab/git-data/repositories/@hashed/62/7f/627fe5175ddbb7378afaa1dfcdb18083762131cbe3d28e85d9a92b531854067f.git
git      3139844 4004643  0 18:37 ?        S      0:18 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/62/7f/627fe5175ddbb7378afaa1dfcdb18083762131cbe3d28e85d9a92b531854067f.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 pack-objects --stdout --revs --thin --delta-base-offset --end-of-options
git      3143621 4004643  0 18:37 ?        S      0:00 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/ff/1b/ff1bb16b67bd08e8e5b8b27cdcaced894139c993c77b2330fb431494bd1a84c5.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c uploadpack.allowFilter=true -c uploadpack.allowAnySHA1InWant=true -c uploadpack.hideRefs=refs/remotes/ -c uploadpack.hideRefs=refs/tmp/ -c uploadpack.hideRefs=refs/keep-around/ -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 -c core.hooksPath=/var/opt/gitlab/gitaly/run/gitaly-4004643/hooks-725560078.d -c uploadpack.packObjectsHook=/var/opt/gitlab/gitaly/run/gitaly-4004643/gitaly-hooks upload-pack --stateless-rpc --end-of-options /var/opt/gitlab/git-data/repositories/@hashed/ff/1b/ff1bb16b67bd08e8e5b8b27cdcaced894139c993c77b2330fb431494bd1a84c5.git
git      3143676 4004643  1 18:37 ?        S      0:21 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/ff/1b/ff1bb16b67bd08e8e5b8b27cdcaced894139c993c77b2330fb431494bd1a84c5.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 pack-objects --stdout --revs --thin --delta-base-offset --end-of-options