Investigate why some gitaly git processes are not assigned to cgroups
Problem summary
When Gitaly spawns child processes such as git or gitaly-hooks, it normally assigns the new process to a per-repo cgroup.
These cgroups provide per-project limits on memory and CPU consumption.
Today I unexpectedly found that intermittently file-cny-01 sometimes has a large minority of its git processes running outside of the Gitaly-managed cgroup pool. Instead, they were running in the same cgroup as gitaly itself. (See example below.)
This implies Gitaly failed to explicitly assign these processes to cgroups when they were created.
As I recall, one of our design decisions was to allow process creation to continue if something went wrong with cgroup assignment, so that any corner cases would not cause gRPC failures. So this is only a practical problem if many processes end up running unconfined (since that would thwart the cgroup resource limits that provide fair insulation between projects).
This issue is to investigate why many git processes are intermittently not being assigned to gitaly-managed cgroups. This was discovered on file-cny-01, but it presumably can occur on other Gitaly nodes too.
Why does this matter?
This can potentially increase the likelihood of Gitaly saturation incidents, since it bypasses the cgroups protections.
The cgroups mechanism aims to prevent any one repo from starving all others on a Gitaly node for CPU or memory. Any processes not assigned to a per-repo cgroup are evading that safety mechanism and implicitly run with no limit (as they did before we implemented cgroups).
The Gitaly-related incident rate dropped significantly since we rolled out cgroups, and they have proven to be an effective mitigation. Solving this gap would increase their effectiveness.
Example
Gitaly's config.toml configures Gitaly to create and manage 1000 CPU cgroups under /sys/fs/cgroup/cpu/gitaly/gitaly-[pid]/repos-{0..999}.
msmiley@file-cny-01-stor-gprd.c.gitlab-production.internal:~$ sudo cat /var/opt/gitlab/gitaly/config.toml | less
...
[cgroups]
mountpoint = "/sys/fs/cgroup"
hierarchy_root = "gitaly"
memory_bytes = 96636764160
cpu_shares = 1024
[cgroups.repositories]
count = 1000
memory_bytes = 64424509440
cpu_shares = 512
msmiley@file-cny-01-stor-gprd.c.gitlab-production.internal:~$ sudo find /sys/fs/cgroup/cpu/gitaly/ -mindepth 2 -maxdepth 2 -type d | wc -l
1000
msmiley@file-cny-01-stor-gprd.c.gitlab-production.internal:~$ sudo find /sys/fs/cgroup/cpu/gitaly/ -mindepth 2 -maxdepth 2 -type d | sort -V | tail -n5
/sys/fs/cgroup/cpu/gitaly/gitaly-4004643/repos-995
/sys/fs/cgroup/cpu/gitaly/gitaly-4004643/repos-996
/sys/fs/cgroup/cpu/gitaly/gitaly-4004643/repos-997
/sys/fs/cgroup/cpu/gitaly/gitaly-4004643/repos-998
/sys/fs/cgroup/cpu/gitaly/gitaly-4004643/repos-999
When Gitaly spawns a new child process (e.g. git), it normally assigns that process to one of those 1000 cgroups, choosing one based on a hash of the repo's path.
The following output surveys all running git processes and shows which cgroup they are assigned to:
msmiley@file-cny-01-stor-gprd.c.gitlab-production.internal:~$ date ; pgrep -x git | xargs -i grep -w cpu /proc/{}/cgroup 2> /dev/null | sort | uniq -c
Wed 08 Mar 2023 07:12:11 PM UTC
197 10:cpu,cpuacct:/gitaly/gitaly-4004643/repos-173
2 10:cpu,cpuacct:/gitaly/gitaly-4004643/repos-81
2 10:cpu,cpuacct:/gitaly/gitaly-4004643/repos-899
2 10:cpu,cpuacct:/gitaly/gitaly-4004643/repos-913
42 10:cpu,cpuacct:/system.slice/gitlab-runsvdir.service
The high count for cgroup repos-173 is expected, because the canary Gitaly node includes one repo that is typically much more active than all others. Consequently, its cgroup tends to have many more processes than the others.
However, we do not expect to have a high count of git processes in cgroup /system.slice/gitlab-runsvdir.service. This is the subject of our investigation.
Those git processes seem like they should have been assigned to per-repo cgroups. Here are a few examples of those processes (the first 10 out of the 42 processes listed above. Notes on these processes:
- They represent several distinct repo paths, not just one.
- The current
gitalyprocess (PID 4004643) is their parent, so they were spawned bygitaly. - Their process age varies, so they are not just extremely young processes.
- At least two distinct git subcommands are represented:
pack-objectsandupload-pack
msmiley@file-cny-01-stor-gprd.c.gitlab-production.internal:~$ date ; for PID in $( pgrep -x git ) ; do grep -q 'cpu,cpuacct:/system.slice/gitlab-runsvdir.service' /proc/$PID/cgroup 2> /dev/null && echo $PID ; done | xargs -r ps -fww
Wed 08 Mar 2023 07:12:26 PM UTC
UID PID PPID C STIME TTY STAT TIME CMD
git 3068450 4004643 0 18:28 ? S 0:05 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/2b/25/2b2517a9dad97e502671d96f4d5c5d0ecba17476beddacf1c70251481304388e.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c uploadpack.allowFilter=true -c uploadpack.allowAnySHA1InWant=true -c uploadpack.hideRefs=refs/remotes/ -c uploadpack.hideRefs=refs/tmp/ -c uploadpack.hideRefs=refs/keep-around/ -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 -c core.hooksPath=/var/opt/gitlab/gitaly/run/gitaly-4004643/hooks-725560078.d -c uploadpack.packObjectsHook=/var/opt/gitlab/gitaly/run/gitaly-4004643/gitaly-hooks upload-pack --stateless-rpc --end-of-options /var/opt/gitlab/git-data/repositories/@hashed/2b/25/2b2517a9dad97e502671d96f4d5c5d0ecba17476beddacf1c70251481304388e.git
git 3068988 4004643 2 18:28 ? S 0:59 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/2b/25/2b2517a9dad97e502671d96f4d5c5d0ecba17476beddacf1c70251481304388e.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 pack-objects --stdout --revs --thin --delta-base-offset --end-of-options
git 3127985 4004643 0 18:35 ? S 0:01 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/0d/b6/0db61c6f2b4da90400de7684806e5ab4e0cb84a866bf87cf5ca61a832f01bd98.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c uploadpack.allowFilter=true -c uploadpack.allowAnySHA1InWant=true -c uploadpack.hideRefs=refs/remotes/ -c uploadpack.hideRefs=refs/tmp/ -c uploadpack.hideRefs=refs/keep-around/ -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 -c uploadpack.allowFilter=true -c uploadpack.allowAnySHA1InWant=true -c receive.maxInputSize=5242880000 -c core.hooksPath=/var/opt/gitlab/gitaly/run/gitaly-4004643/hooks-725560078.d -c uploadpack.packObjectsHook=/var/opt/gitlab/gitaly/run/gitaly-4004643/gitaly-hooks upload-pack --end-of-options /var/opt/gitlab/git-data/repositories/@hashed/0d/b6/0db61c6f2b4da90400de7684806e5ab4e0cb84a866bf87cf5ca61a832f01bd98.git
git 3130053 4004643 0 18:35 ? S 0:19 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/0d/b6/0db61c6f2b4da90400de7684806e5ab4e0cb84a866bf87cf5ca61a832f01bd98.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 pack-objects --stdout --revs --thin --progress --delta-base-offset --end-of-options
git 3137719 4004643 0 18:36 ? S 0:00 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/66/5d/665d5447adfbca31d7b1e6aaf3e806c16ccc8265f89fa43c4e316be2615c4bb2.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c uploadpack.allowFilter=true -c uploadpack.allowAnySHA1InWant=true -c uploadpack.hideRefs=refs/remotes/ -c uploadpack.hideRefs=refs/tmp/ -c uploadpack.hideRefs=refs/keep-around/ -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 -c core.hooksPath=/var/opt/gitlab/gitaly/run/gitaly-4004643/hooks-725560078.d -c uploadpack.packObjectsHook=/var/opt/gitlab/gitaly/run/gitaly-4004643/gitaly-hooks upload-pack --stateless-rpc --end-of-options /var/opt/gitlab/git-data/repositories/@hashed/66/5d/665d5447adfbca31d7b1e6aaf3e806c16ccc8265f89fa43c4e316be2615c4bb2.git
git 3137756 4004643 0 18:36 ? S 0:18 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/66/5d/665d5447adfbca31d7b1e6aaf3e806c16ccc8265f89fa43c4e316be2615c4bb2.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 pack-objects --stdout --revs --thin --delta-base-offset --end-of-options
git 3139794 4004643 0 18:37 ? S 0:00 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/62/7f/627fe5175ddbb7378afaa1dfcdb18083762131cbe3d28e85d9a92b531854067f.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c uploadpack.allowFilter=true -c uploadpack.allowAnySHA1InWant=true -c uploadpack.hideRefs=refs/remotes/ -c uploadpack.hideRefs=refs/tmp/ -c uploadpack.hideRefs=refs/keep-around/ -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 -c core.hooksPath=/var/opt/gitlab/gitaly/run/gitaly-4004643/hooks-725560078.d -c uploadpack.packObjectsHook=/var/opt/gitlab/gitaly/run/gitaly-4004643/gitaly-hooks upload-pack --stateless-rpc --end-of-options /var/opt/gitlab/git-data/repositories/@hashed/62/7f/627fe5175ddbb7378afaa1dfcdb18083762131cbe3d28e85d9a92b531854067f.git
git 3139844 4004643 0 18:37 ? S 0:18 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/62/7f/627fe5175ddbb7378afaa1dfcdb18083762131cbe3d28e85d9a92b531854067f.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 pack-objects --stdout --revs --thin --delta-base-offset --end-of-options
git 3143621 4004643 0 18:37 ? S 0:00 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/ff/1b/ff1bb16b67bd08e8e5b8b27cdcaced894139c993c77b2330fb431494bd1a84c5.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c uploadpack.allowFilter=true -c uploadpack.allowAnySHA1InWant=true -c uploadpack.hideRefs=refs/remotes/ -c uploadpack.hideRefs=refs/tmp/ -c uploadpack.hideRefs=refs/keep-around/ -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 -c core.hooksPath=/var/opt/gitlab/gitaly/run/gitaly-4004643/hooks-725560078.d -c uploadpack.packObjectsHook=/var/opt/gitlab/gitaly/run/gitaly-4004643/gitaly-hooks upload-pack --stateless-rpc --end-of-options /var/opt/gitlab/git-data/repositories/@hashed/ff/1b/ff1bb16b67bd08e8e5b8b27cdcaced894139c993c77b2330fb431494bd1a84c5.git
git 3143676 4004643 1 18:37 ? S 0:21 /var/opt/gitlab/gitaly/run/gitaly-4004643/git-exec-1607399968.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/ff/1b/ff1bb16b67bd08e8e5b8b27cdcaced894139c993c77b2330fb431494bd1a84c5.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 pack-objects --stdout --revs --thin --delta-base-offset --end-of-options