Usage of Cgroups results in memory leak on kernel side
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Problem
It appears that running an application does cause ever growing memory leak on kernel side.
It is yet unclear which part of the application causes this, but the following symptoms were observed:
- The
Percpu
metric grows significantly overtime as shown in/proc/meminfo
. On long running systems reaching as much as 10-40GiB.- The leak is higher for a systems with a high number of CPUs. Since this the Percpu is proportional to number of CPUs.
- The
/proc/cgroups
shows a growing number ofpids
andmemory
hierarchies. - The
/sys/fs/cgroup(2)
does not show any abnormal amount of hiearchies. - The amount of usable memory accessible to user-space is constantly decreasing.
Data points
Prometheus graphs
I don't currently see graphs for Percpu for nodes running Rails.
Below are graphs where I saw the elevated Percpu usage on gitlab.com production environment:
/proc/meminfo
$ cat /proc/meminfo
MemTotal: 97101260 kB
MemFree: 17885700 kB
MemAvailable: 17913332 kB
Buffers: 12472 kB
Cached: 446596 kB
SwapCached: 0 kB
Active: 319972 kB
Inactive: 32139096 kB
Active(anon): 123708 kB
Inactive(anon): 31987920 kB
Active(file): 196264 kB
Inactive(file): 151176 kB
Unevictable: 16816 kB
Mlocked: 16816 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 1032 kB
Writeback: 0 kB
AnonPages: 32017036 kB
Mapped: 378832 kB
Shmem: 124156 kB
KReclaimable: 753260 kB
Slab: 7260048 kB
SReclaimable: 753260 kB
SUnreclaim: 6506788 kB
KernelStack: 67904 kB
PageTables: 200088 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 48550628 kB
Committed_AS: 29679856 kB
VmallocTotal: 133143592960 kB
VmallocUsed: 1692756 kB
VmallocChunk: 0 kB
Percpu: 37122240 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
CmaTotal: 32768 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
/proc/cgroups
$ cat /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
cpuset 9 1 1
cpu 11 143 1
cpuacct 11 143 1
blkio 10 141 1
memory 6 207719 1
devices 12 137 1
freezer 13 2 1
net_cls 8 1 1
perf_event 4 1 1
net_prio 8 1 1
hugetlb 2 1 1
pids 7 207680 1
rdma 5 1 1
misc 3 1 1
What were tested?
- We tried to run
echo 3 > /proc/sys/vm/drop_caches
, and as expected it did not work. - Restart nodes. It does workaround the problem.
Edited by 🤖 GitLab Bot 🤖