High RSS usage by Gitaly process

A recent production emergency for a large premium customer on v13.10.1 was caused by this behavior.

They have three Gitaly servers sharing the same underlying NFS mount. Two of the three were consuming ~22 GiB of memory, causing memory pressure on those hosts and reduced performance. Typical memory consumption for their Gitaly processes is ~5 GiB.

pprof heap profile showed ~80 MiB of objects. Unlike Praefect, there were only 29 goroutines active in the Gitaly process we took a thread dump from.

Status summary

The Go runtime has 2 ways of releasing memory back to the OS after a temporary spike in heap space size:

  • MADV_FREE is the default method for Go versions 1.12 to 1.15.
  • MADV_DONTNEED is the default for Go 1.16 (and also for 1.11 and older).

Both methods use the madvise syscall to notify the kernel that specific pages of the process's anonymous private memory can be reclaimed if needed. The difference is that MADV_FREE tends to defer that reclaim for longer than MADV_DONTNEED.

Gitaly is currently built with the Go 1.15 runtime, so it has the MADV_FREE behavior (i.e. kernel waits roughly as long as possible to reclaim the marked memory pages from the go process). This long-deferred memory reclaim can indirectly affect performance under some circumstances (e.g. prevent that memory from being used for other useful purposes, including the filesystem cache, which Gitaly relies on to reduce disk I/O).

In the near future, new releases of Gitaly will be built with Go 1.16, consequently switching to the MADV_DONTNEED behavior, making that memory available for other purposes more quickly.

See this thread #3567 (comment 557318054) for a more detailed summary and discussion of our findings, covering why we think the reported symptoms can be explained by this interaction between the kernel's and go runtime's memory managers.

Work-arounds

Any of the following work-arounds may help self-hosted customers who are affected by this problem (i.e. Gitaly retaining memory long after the end of a spike in memory-intensive workload):

  1. Set the GODEBUG environment variable to madvdontneed=1 on the Gitaly process.
  2. Disable swap on the Gitaly host, which implicitly makes MADV_FREE reclaim memory immediately.
  3. Wait for Gitaly to be built with go runtime 1.16.

For more details on work-arounds, see this discussion thread: #3567 (comment 560803100)

Edited by Matt Smiley