How are git processes sometimes not assigned to a gitaly cgroup?
This issue is a spin-off from https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23532 to discuss and explore one of several useful discoveries: Occasionally the gitaly
process will spawn a child process (typically a git
process) but not assign it to a cgroup.
Some initial discussion is in this thread: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23532#note_1377531701
What happens normally?
The expected control flow is for gitaly to first create its child process and then assign it to a cgroup.
Failing to assign the process to a cgroup should not (and does not) impede the process from running normally. However, it does mean the process's CPU and memory usage will not be constrained in the expected way.
Every process has a cgroup, by default inherited from its parent process.
The gitaly
process runs in a chef-managed cgroup (/system.slice/gitlab-runsvdir.service
), so initially all of its child processes begin life in that cgroup too.
$ cat /proc/$( pidof gitaly )/cgroup | grep 'cpuacct'
7:cpu,cpuacct:/system.slice/gitlab-runsvdir.service
Shortly after gitaly creates a child process, it reassigns that process to a per-repo cgroup under a gitaly-managed cgroup hierarchy: /gitaly/gitaly-[pid]/repos-[nnn]
$ pgrep -x git | head | xargs -i grep 'cpuacct' /proc/{}/cgroup
7:cpu,cpuacct:/gitaly/gitaly-4117892/repos-23
7:cpu,cpuacct:/gitaly/gitaly-4117892/repos-537
7:cpu,cpuacct:/gitaly/gitaly-4117892/repos-537
7:cpu,cpuacct:/gitaly/gitaly-4117892/repos-796
7:cpu,cpuacct:/gitaly/gitaly-4117892/repos-150
7:cpu,cpuacct:/gitaly/gitaly-4117892/repos-968
7:cpu,cpuacct:/gitaly/gitaly-4117892/repos-537
7:cpu,cpuacct:/gitaly/gitaly-4117892/repos-150
7:cpu,cpuacct:/gitaly/gitaly-4117892/repos-150
7:cpu,cpuacct:/gitaly/gitaly-4117892/repos-51
If that cgroup assignment fails for any reason, the process will continue to run in its original cgroup.
Example of the problem
Caveats for this audit:
- There are some false examples from residual processes that were spawned months ago before the gitaly cgroups feature was even enabled. Those handful of ancient residual processes should not even still be running. We will ignore those as an unrelated and low priority issue. To filter them out, we will only look at processes whose elapsed time is less than 7 days.
- Also, to avoid false matches on newly spawned processes, we will also exclude processes younger than 1 second old.
This example shows a production gitaly node with 2 long-running git processes running outside of the expected cgroup hierarchy.
This audit command searches for git
processes that are still assigned to gitaly's default cgroup (/system.slice/gitlab-runsvdir.service
) rather than a per-repo cgroup. It also excludes very young or very old processes (etimes
= elapsed age in seconds), as noted in the caveats above.
The output shows 2 git
processes (a git pack-objects
and a git upload-pack
), both associated with the same correlation_id and both over an hour old. Neither has used much CPU or memory, but both are running in the wrong cgroup.
msmiley@file-04-stor-gprd.c.gitlab-production.internal:~$ for PID in $( pgrep -x git ) ; do grep 'cpuacct:/system.slice/gitlab-runsvdir.service' /proc/$PID/cgroup >& /dev/null || continue ; [[ $( ps -p $PID -o etimes | awk '$1 > 1 && $1 < 7*24*60*60' | wc -l ) -gt 0 ]] || continue ; ps -p $PID -ww -o pid,ppid,etimes,etime,cputime,rss,lstart,args ; echo ; sudo cat /proc/$PID/environ | tr '\0' '\n' | grep 'CORRELATION_ID' ; echo ; cat /proc/$PID/cgroup | egrep 'cpuacct|memory' ; echo ; done
PID PPID ELAPSED ELAPSED TIME RSS STARTED COMMAND
3028825 151814 5054 01:24:14 00:00:00 27428 Thu May 4 22:40:57 2023 /var/opt/gitlab/gitaly/run/gitaly-151814/git-exec-3900116090.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/fa/74/fa740ffa48f6f7ef3e2960bd9e4b086181565aa00d85fe7d499ed3035471b6aa.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c uploadpack.allowFilter=true -c uploadpack.allowAnySHA1InWant=true -c uploadpack.hideRefs=refs/keep-around/ -c uploadpack.hideRefs=refs/remotes/ -c uploadpack.hideRefs=refs/tmp/ -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 -c core.hooksPath=/var/opt/gitlab/gitaly/run/gitaly-151814/hooks-3588709574.d -c uploadpack.packObjectsHook=/var/opt/gitlab/gitaly/run/gitaly-151814/gitaly-hooks upload-pack --stateless-rpc --end-of-options /var/opt/gitlab/git-data/repositories/@hashed/fa/74/fa740ffa48f6f7ef3e2960bd9e4b086181565aa00d85fe7d499ed3035471b6aa.git
CORRELATION_ID=01GZMESZMBTR8ZHVR0WGE41B5M
11:memory:/system.slice/gitlab-runsvdir.service
3:cpu,cpuacct:/system.slice/gitlab-runsvdir.service
PID PPID ELAPSED ELAPSED TIME RSS STARTED COMMAND
3028843 151814 5054 01:24:14 00:00:06 1310132 Thu May 4 22:40:57 2023 /var/opt/gitlab/gitaly/run/gitaly-151814/git-exec-3900116090.d/git --git-dir /var/opt/gitlab/git-data/repositories/@hashed/fa/74/fa740ffa48f6f7ef3e2960bd9e4b086181565aa00d85fe7d499ed3035471b6aa.git -c gc.auto=0 -c core.autocrlf=input -c core.useReplaceRefs=false -c core.fsync=objects,derived-metadata,reference -c core.fsyncMethod=fsync -c pack.windowMemory=100m -c pack.writeReverseIndex=true -c pack.threads=5 pack-objects --stdout --revs --thin --delta-base-offset --end-of-options
CORRELATION_ID=01GZMESZMBTR8ZHVR0WGE41B5M
11:memory:/system.slice/gitlab-runsvdir.service
3:cpu,cpuacct:/system.slice/gitlab-runsvdir.service
Survey production nodes
How wide-spread is this problem?
We can survey the number of git
processes in the wrong cgroup as of the current moment in time. Ideally the count should be zero on all nodes (since we exclude very young processes). But several nodes have a non-zero count.
Again, this only practically matters when a node is under pressure for CPU or memory. But the existence of this problem erodes the efficacy of cgroup limits during saturation events, so we want to proactively address it.
$ mussh -m10 -h file-{{01..99},{100..102}}-stor-gprd.c.gitlab-production.internal -c 'pgrep -x "git" | xargs -r ps -o pid,etimes,cgroup:1000 | grep -v "PID" | awk "\$2 > 1 && \$2 < 7*24*60*60" | egrep -v "/gitaly/gitaly-[0-9]*/repos-[0-9]*" | awk "{ print \$1 }" | xargs -r ps -ww -o pid,ppid,lstart,etime,cputime,s,comm,cgroup:10000,args | tee ps.git_processes_outside_gitaly_cgroups.$( date +%Y%m%d_%H%M%S ).out | wc -l'
file-02-stor-gprd.c.gitlab-production.internal: 0
file-01-stor-gprd.c.gitlab-production.internal: 0
file-03-stor-gprd.c.gitlab-production.internal: 0
file-04-stor-gprd.c.gitlab-production.internal: 5
file-05-stor-gprd.c.gitlab-production.internal: 0
file-06-stor-gprd.c.gitlab-production.internal: 0
file-07-stor-gprd.c.gitlab-production.internal: 0
file-08-stor-gprd.c.gitlab-production.internal: 0
file-09-stor-gprd.c.gitlab-production.internal: 3
file-100-stor-gprd.c.gitlab-production.internal: 0
file-101-stor-gprd.c.gitlab-production.internal: 0
file-102-stor-gprd.c.gitlab-production.internal: 0
file-11-stor-gprd.c.gitlab-production.internal: 0
file-10-stor-gprd.c.gitlab-production.internal: 0
file-13-stor-gprd.c.gitlab-production.internal: 0
file-12-stor-gprd.c.gitlab-production.internal: 3
file-14-stor-gprd.c.gitlab-production.internal: 0
file-15-stor-gprd.c.gitlab-production.internal: 3
file-16-stor-gprd.c.gitlab-production.internal: 0
file-17-stor-gprd.c.gitlab-production.internal: 0
file-19-stor-gprd.c.gitlab-production.internal: 0
file-18-stor-gprd.c.gitlab-production.internal: 7
file-20-stor-gprd.c.gitlab-production.internal: 3
file-21-stor-gprd.c.gitlab-production.internal: 0
file-22-stor-gprd.c.gitlab-production.internal: 0
file-23-stor-gprd.c.gitlab-production.internal: 0
file-24-stor-gprd.c.gitlab-production.internal: 0
file-25-stor-gprd.c.gitlab-production.internal: 3
file-27-stor-gprd.c.gitlab-production.internal: 4
file-29-stor-gprd.c.gitlab-production.internal: 15
file-26-stor-gprd.c.gitlab-production.internal: 3
file-28-stor-gprd.c.gitlab-production.internal: 0
file-30-stor-gprd.c.gitlab-production.internal: 0
file-31-stor-gprd.c.gitlab-production.internal: 0
file-32-stor-gprd.c.gitlab-production.internal: 0
file-33-stor-gprd.c.gitlab-production.internal: 0
file-34-stor-gprd.c.gitlab-production.internal: 0
file-35-stor-gprd.c.gitlab-production.internal: 3
file-42-stor-gprd.c.gitlab-production.internal: 0
file-36-stor-gprd.c.gitlab-production.internal: 0
file-38-stor-gprd.c.gitlab-production.internal: 0
file-37-stor-gprd.c.gitlab-production.internal: 3
file-40-stor-gprd.c.gitlab-production.internal: 0
file-41-stor-gprd.c.gitlab-production.internal: 0
file-39-stor-gprd.c.gitlab-production.internal: 0
file-43-stor-gprd.c.gitlab-production.internal: 0
file-44-stor-gprd.c.gitlab-production.internal: 3
file-45-stor-gprd.c.gitlab-production.internal: 3
file-46-stor-gprd.c.gitlab-production.internal: 0
file-48-stor-gprd.c.gitlab-production.internal: 0
file-47-stor-gprd.c.gitlab-production.internal: 0
file-49-stor-gprd.c.gitlab-production.internal: 3
file-53-stor-gprd.c.gitlab-production.internal: 0
file-52-stor-gprd.c.gitlab-production.internal: 0
file-50-stor-gprd.c.gitlab-production.internal: 0
file-51-stor-gprd.c.gitlab-production.internal: 7
file-54-stor-gprd.c.gitlab-production.internal: 0
file-55-stor-gprd.c.gitlab-production.internal: 4
file-56-stor-gprd.c.gitlab-production.internal: 13
file-58-stor-gprd.c.gitlab-production.internal: 3
file-57-stor-gprd.c.gitlab-production.internal: 0
file-59-stor-gprd.c.gitlab-production.internal: 0
file-61-stor-gprd.c.gitlab-production.internal: 0
file-60-stor-gprd.c.gitlab-production.internal: 3
file-62-stor-gprd.c.gitlab-production.internal: 0
file-63-stor-gprd.c.gitlab-production.internal: 3
file-64-stor-gprd.c.gitlab-production.internal: 3
file-65-stor-gprd.c.gitlab-production.internal: 0
file-66-stor-gprd.c.gitlab-production.internal: 0
file-67-stor-gprd.c.gitlab-production.internal: 0
file-68-stor-gprd.c.gitlab-production.internal: 0
file-70-stor-gprd.c.gitlab-production.internal: 0
file-69-stor-gprd.c.gitlab-production.internal: 0
file-71-stor-gprd.c.gitlab-production.internal: 0
file-72-stor-gprd.c.gitlab-production.internal: 0
file-73-stor-gprd.c.gitlab-production.internal: 0
file-74-stor-gprd.c.gitlab-production.internal: 0
file-75-stor-gprd.c.gitlab-production.internal: 3
file-76-stor-gprd.c.gitlab-production.internal: 0
file-77-stor-gprd.c.gitlab-production.internal: 0
file-78-stor-gprd.c.gitlab-production.internal: 3
file-79-stor-gprd.c.gitlab-production.internal: 0
file-80-stor-gprd.c.gitlab-production.internal: 0
file-81-stor-gprd.c.gitlab-production.internal: 0
file-82-stor-gprd.c.gitlab-production.internal: 0
file-83-stor-gprd.c.gitlab-production.internal: 3
file-84-stor-gprd.c.gitlab-production.internal: 0
file-85-stor-gprd.c.gitlab-production.internal: 0
file-86-stor-gprd.c.gitlab-production.internal: 0
file-87-stor-gprd.c.gitlab-production.internal: 0
file-88-stor-gprd.c.gitlab-production.internal: 0
file-90-stor-gprd.c.gitlab-production.internal: 0
file-91-stor-gprd.c.gitlab-production.internal: 0
file-89-stor-gprd.c.gitlab-production.internal: 110
file-92-stor-gprd.c.gitlab-production.internal: 3
file-94-stor-gprd.c.gitlab-production.internal: 0
file-93-stor-gprd.c.gitlab-production.internal: 0
file-95-stor-gprd.c.gitlab-production.internal: 0
file-96-stor-gprd.c.gitlab-production.internal: 0
file-97-stor-gprd.c.gitlab-production.internal: 0
file-98-stor-gprd.c.gitlab-production.internal: 0
file-99-stor-gprd.c.gitlab-production.internal: 0