Fix container memory saturation metrics

We have discovered that there is currently no suitable metric for container memory usage to record memory saturation for our production deployments. We have three metrics we can use from cAdvisor, and all have cases where they differ significantly from what we'd consider the correct value.

Usage is all memory used by the container. This sounds useful, but includes items that will be evicted under memory pressure without the container being out of memory (OOM) killed. This is not useful for any of our cases as it will appear that we are closer to saturation than we are.
Working set size (WSS) takes the total memory usage and subtracts inactive_file. This represents 'inactive' file-backed memory to correct for the above, but still includes active_file, and this can also be evicted under memory pressure. This also overstates the usage.
Resident set size (RSS) only includes anonymous (i.e. not file-backed) memory and swap cache. Unfortunately, it also includes memory that applications have marked as freed but available to be reclaimed immediately by the application without a page fault. In most cases, therefore, this is a good metric, but in some cases it is catastrophic: we have seen it overestimate memory usage by 10x for some workloads.

How can we fix this?

We have three main avenues that we can explore:

Get more granular data from cAdvisor. If we can extract just active_anon and inactive_anon from the cgroup's memory stats, we can use those to get the best possible metric. google/cadvisor#3197 is the upstream issue for this. This is our best option.
Avoid MADV_FREE where possible. We have found that Ruby uses this operation in one specific case. We've filed Ruby issue 19122 to discuss this upstream. Even if that proposal is accepted, though, we will have to upgrade Ruby to take a 'fixed' version, and we may still have programs written in other languages that use this operation.
Finesse our metrics to pick the least-bad of the two current options (WSS and RSS), depending on the service we're using. This is not ideal, because we know that WSS in particular is not an accurate measurement, and in the capacity planning case we perform a lot of aggregations (typically taking the maximum) across a service, so it can be hard to debug.

The rest of this issue is a summary of the discussions in the private issues capacity-planning#171 and capacity-planning#407, where @msmiley, @mkaeppler, and I investigated these issues. It's mostly included for reference.

Background

We want to record the memory usage of our services running on Kubernetes. We use cAdvisor to gather memory metrics from the underlying cgroup, and then add labels to those container-level metrics so that we can aggregate these metrics up to the component and service level. We then use these aggregations for alerting as well as long-term saturation forecasting.

These alerts and predictions should be good proxies for the underlying event that we really care about: OOM kills. When a container is OOM killed, it is immediately terminated and cannot continue with the work it had in progress, which can have a user impact (for instance, an error when loading a page). Measuring OOM kills directly is not possible as a saturatable resource, because a container is either OOM killed or it isn't, and we want to measure how close we are to that state.

As such, we want to be able to divide current memory usage by the memory limit defined by the container, to see how close we are to that limit.

Memory cgroup fundamentals

The cgroups documentation on the stat file gives these definitions:

Field	Description
inactive_anon	# of bytes of anonymous and swap cache memory on inactive LRU list.
active_anon	# of bytes of anonymous and swap cache memory on active LRU list.
inactive_file	# of bytes of file-backed memory on inactive LRU list.
active_file	# of bytes of file-backed memory on active LRU list.

From there, I'm just going to quote Matt:

What counts as being charged to a memory cgroup?

anonymous pages allocated by processes in the cgroup

file-backed pages mapped, read, or written by processes in the cgroup, except any pages already charged to another cgroup

What pages can be evicted to avoid OOMK?

file-backed pages, regardless of whether or not they were loaded via mmap, sysread, etc., unless they are explicitly locked into memory (which is rare)

anonymous pages only if swap is enabled (at the host and cgroup levels)

So effectively, we expect a memory cgroup's memory usage to behave similarly to a host without cgroups:

On swapless hosts like ours, anonymous memory cannot be evicted, so accumulating anonymous memory is the most common way to force kernel to kill processes.

Free memory tends to be used for filesystem caching, but any file-backed pages can be evicted from the page cache (and flushed to disk if recently written).

A workload that thrashes the cache with mostly file-backed pages will not typically trigger the kernel's OOM killer, because file-backed pages can always be evicted from memory to alleviate pressure. However, this thrashing hurts performance via excessive filesystem IO.

Usage problems

This is the simplest to describe: usage simply counts too much, including items that will be evicted before the OOM killer takes action. From the table above, we can say that it's inactive_anon + active_anon + inactive_file + active_file. It is not simply that, but it does include all those values.

Per https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-part-3-container-resource-metrics-361c5ee46e66:

You might think that memory utilization is easily tracked with container_memory_usage_bytes, however, this metric also includes cached (think filesystem cache) items that can be evicted under memory pressure.

Unfortunately, the next sentence in that post is incorrect:

The better metric is container_memory_working_set_bytes as this is what the OOM killer is watching for.

We'll explore why this is false in the next section.

WSS problems

Working set is an ambiguous term, as noted in this comment from Matt. Here, we're using it to mean working set as reported by cAdvisor (container_memory_working_set_bytes). This is calculated by taking usage and subtracting inactive_file, as seen in the cAdvisor source:

inactiveFileKeyName := "total_inactive_file"
if cgroups.IsCgroup2UnifiedMode() {
    inactiveFileKeyName = "inactive_file"
}

workingSet := ret.Memory.Usage
if v, ok := s.MemoryStats.Stats[inactiveFileKeyName]; ok {
    if workingSet < v {
        workingSet = 0
    } else {
        workingSet -= v
    }
}
ret.Memory.WorkingSet = workingSet

This still includes active_file, however. Matt explains why even active memory is evictable:

File pages in the "active" list are not evictable... until they get demoted back down to the "inactive" list. When the cgroup is starving for memory and needs to free a page (e.g. to satisfy a process requesting anonymous memory), it can shrink the total number of filesystem cache pages, and then the normal mechanism of demoting pages from the "active" list to the "inactive" list allows those previously unevictable pages to become eviction candidates the next time. There are only a few special cases where file-backed pages tend to not be evictable, which is why when we see an OOM kill event, the kernel's verbose logs for that kill typically show that most of the memory was anonymous, not file-backed.

[...]

I lean towards treating just the anonymous memory by itself as a saturation metric, since on swapless hosts it is guaranteed to be unevictable.

In our case, we see that WSS is significantly higher than RSS for our Rails containers, and can grow over time:

RSS problems

RSS should be what we want, as it should match the anonymous memory usage, excluding file-backed memory. Unfortunately, when we tried out the calculations on our production metrics, we saw that our logging service looked to be closer to saturation. Looking closer, we can see that this is on our fluentd-elasticsearch pods. Looking at the ratio of WSS to RSS for an example pod, we see the opposite effect to Rails - WSS is between 10% and 20% of RSS, which means that RSS is now overestimated:

Matt investigated again and found:

The fluentd (ruby) process uses MADV_FREE to mark its anonymous memory as lazily reclaimable by the kernel. The cgroup-level accounting (in memory.stat) of those LazyFree pages is unintuitive:

memory.stat.rss includes them, because they are still private anonymous pages mapped to a process.

memory.stat.*_anon excludes them, because they have been explicitly freed by that process back to kernel.

memory.stat.inactive_file includes them, because they have to be accounted in one of these four buckets (active/inactive anon/file), and this one may make the most sense -- they are evictable from the page cache, just like inactive file-backed pages.

My take-away from this is that for a memory saturation metric, we might be better off explicitly summing active_anon + inactive_anon, rather than assuming rss always does that for us.

MADV_FREE is an operation for the madvise system call. Its documentation says:

The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs. For each of the pages that has been marked to be freed but has not yet been freed, the free operation will be canceled if the caller writes into the page.

Matt also pointed to a Gitaly RSS investigation that was related to a similar issue where Go used MADV_FREE, before reverting that change due to exactly this sort of issue:

This generally leads to poor user experience, like confusing stats in top and other monitoring tools; and bad integration with management systems that respond to memory usage.

We showed that in Ruby's case, this is because of fibers (which are lighter weight than threads). When a Ruby fiber's stack is freed, the memory holding that stack gets marked as MADV_FREE. The Async series of libraries uses fibers heavily. Fluentd uses this implicitly in its monitor_agent plugin, as it uses Async::HTTP for handling HTTP requests. We use monitor_agent in our config.

Incidentally, here we found that Ruby accidentally broke this use of MADV_FREE in 3.1, and it apparently went unnoticed.