Reference: Analyzing Linux kernel memory management anomalies

The main body of notes will be here in the issue description. Supplemental notes will be added in the comments, but I will update the description, to make it easier to consume as a tour of some related bits of kernel memory management.

Intro

Memory management is a huge topic, and we will not attempt to cover it all. These notes focus on 2 types of memory management in the Linux kernel: slab and page allocations.

This issue aims to share some useful discoveries and observability tools/techniques that came out of researching the root cause of a recurring subtle memory starvation pathology on Kubernetes nodes in incident https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/27. I suspect the same mechanism was likely at the root of some other past incidents where memory pressure induced latency spikes and container failures.

Beyond the incident itself, these findings have more general implications for Kubernetes nodes. We explore how a workload can potentially drive a large amount of memory usage to accrue outside of cgroups -- and consequently outside of the containers' prescribed limits. This usage at the node level indirectly pressures well-behaved containers, leading to stalls, memory reclaim, and ultimately OOM kills.

Along the way, we will highlight some unintuitive observability gotchas and some surprisingly useful tools that folks may find helpful in more general explorations of memory usage.

Primer

The following sections briefly review some background on:

How do userspace processes and userspace memory managers interact with the kernel's management of physical memory?
What do the 2 main kernel memory allocators do, and how do they differ in purpose?

Layered memory managers

We are going to mostly gloss past userspace memory managers and focus on the underlying kernel memory management. But for context, here is a brief overview of some of the ways userspace processes can request memory from the kernel.

For most workloads, most of the physical memory is allocated through the page cache by the page allocator. Broadly speaking, memory can be either "file-backed" (a copy of part of a file on disk) or "anonymous" (not file-backed). When a process reads or writes a file or loads a library function, that can implicitly cause kernel to allocate memory on behalf of the process to read the requested portion of a file from disk into memory. In contrast, when a process needs to grow its heap, create stack space for a new thread, or allocate some off-heap memory, kernel will allocate anonymous memory. Both of these use-cases (file pages and anonymous pages memory) are handled by the same mechanism: the kernel's page allocator. An important difference is that file-backed memory can easily be freed by evicting the page from cache (after flushing any writes to disk), but anonymous memory generally cannot be evicted unless a swap partition is enabled. Because of these distinct eviction properties, kernel maintains separate lists of pages that are file-backed versus anonymous. But the page allocation mechanism and interface is the same.

Many userspace runtimes (e.g. ruby, go) and libraries (e.g. jemalloc) have their own memory management models that provide features and optimization goals specific to their operating context. Those userspace memory managers are layered on top of the kernel's memory manager. They request pages of memory from the kernel (e.g. by calling libc's malloc and friends or by mmap syscalls) and then re-allocate portions of that memory to application threads (e.g. as heap or off-heap memory). In these notes, we will focus on the kernel layer, but for context it can be helpful to remember that these varied userspace memory managers are built on top of the kernel's page allocator.

Kernel manages memory allocations in units call pages.

When a userspace process gets memory from kernel, the allocated size will be one or more whole pages of memory:

A normal page is 4 KB (PAGE_SIZE = 4096 bytes). When documentation talks about "a page", this is what it means. Larger allocations are a multiple of PAGE_SIZE.
A huge page is typically 2 MB on Linux. It exists as an optimization for systems with a large amount of memory, to improve the cache hit rate for address translations via the CPU's translation look-aside buffer (TLB). Linux implements 2 interfaces for using huge pages: HugeTLB and transparent huge pages (THP). For context, accessing main memory is relatively slow, requiring several CPU cycles, and each address translation may require multiple round-trips. Increasing the TLB cache hit rate reduces how often threads stall waiting for address translation, which improves the CPU's instruction throughput.

Kernel provides 2 main mechanisms for allocating physical memory:

The page allocator allocates one or more whole pages to the caller, rounding up the request size if needed. This allocator supports both userspace and kernel memory requests, and typically most memory is allocated through it. Internally it maintains lists of free pages of memory, which it organizes into size bracketed runs of contiguous free pages to resist external fragmentation of the free space.
The slab allocator handles requests within the kernel for amounts of memory smaller than a page. It efficiently allocates these requests from reusable pools of small fixed-size objects. Since the objects are fixed size, a certain number of objects fit per page. When one of those slab caches needs to grow to store more objects, it requests more pages from the page allocator. Userspace processes do not directly allocate memory from slab caches, but they can indirectly cause kernel to do so. For example, when a process opens a file, kernel allocates inode and dentry objects from their respective slab caches to represent the file and its directory hierarchy.

Typically most memory is allocated via the page allocator, so let's focus there first.

Page allocator

As its name suggests, the page allocator allocates one or more whole pages of physical memory to the caller.

When a process requests memory via malloc or mmap, this usually does not immediately map physical memory onto the virtual address range returned to that process. Unless immediate allocation is specifically requested, kernel will wait to allocate physical pages until the first time each virtual page is accessed. This "just in time" allocation of physical memory is one of the ways kernel supports over-committing memory. Typically physical pages get allocated during page faults, which behave as follows. Each process has a page table that maps its virtual addresses to physical pages of memory. While the process is running on CPU, the processor uses that page table's entries (PTEs) to translate virtual-to-physical addresses. When a virtual address is accessed but its PTE does not yet map a physical page, the CPU throws a page fault, which triggers a kernel trap to handle it. The kernel's page allocator assigns a physical page, updates the process's page table entry for future reference, and returns the newly assigned page's physical address, letting the CPU resume normal execution of the process -- all without the process's awareness.

Free pages are tracked by freelists that are grouped by the number of contiguous free pages. When a page is freed, if it is adjacent to another free page, they can become "buddies" on a freelist. Groups of adjacent buddies are tracked in powers-of-two size brackets: a single page has order 0 (2^0), two adjacent pages have order 1 (2^1), four pages have order 2 (2^2), etc. So a request for 8 pages of contiguous memory can be satisfied by a freelist entry of at least "order 3" (2^3 = 8 pages). This is the "buddy" system, and its purpose is to resist fragmenting free memory. It allocates small requests from small runs of contiguous free pages, while preserving the larger runs of contiguous free pages for larger requests. /proc/buddyinfo summarizes the number of contiguous runs of free pages for each 2^[order] size, grouped by NUMA node and zone. If free memory becomes too fragmented, then large allocation requests can fail. To reduce that risk, when fragmentation exceeds a tunable threshold (vm.extfrag_threshold), kernel attempts to migrate pages to consolidate free memory into larger runs.

The page allocator internally consists of the buddy allocator (described above) and the per-CPU page allocator (PCP). Allocating pages from the buddy allocator's global freelists requires a global lock, which could become contended under heavy load from multiple processors. To avoid that risk of excessive contention, the page allocator also maintains per-CPU pools of free pages. This allows delegation away from the global freelists. By pre-assigning some free pages to be allocated exclusively by each CPU, many page allocation requests can be satisfied without taking the global lock. Instead, whichever CPU is running the process requesting memory can allocate pages from its local PCP freelist. These per-CPU pages are drained or refilled as needed from the global freelists. If a page allocation request cannot be satisfied by the PCP allocator (e.g. because the local CPU was not already assigned enough pages of the requested order and zone), it will automatically fall through to the buddy allocator. The caller does not know or care if the page allocator satisfied its request via the per-CPU or global freelists; that is an internal implementation detail, not part of the calling API.

Page allocation requests for more than one page (order > 0) return a compound page. A compound page is a contiguous series of normal pages, where the first page is the "head" and all subsequent pages are "tails". For example, "HTTT" = a 16 KB compound page, consisting of 1 head and 3 tails. Hugepages and transparent huge pages are examples of compound pages (where 1 hugepage = 512 contiguous normal pages = 2 MB). Slab caches where each slab is more than one page also use compound pages.

Slab allocator

RESUME WRITING/EDITING HERE -- Notes below this line are unpolished scratch notes.

Key concepts:

Slab allocator's kmalloc is "the normal method of allocating memory for objects smaller than page size in the kernel."
- Large allocations can bypass slab, but this is typically only done when the allocation is over 8 pages (such that minimal waste comes from partial page use).
- Slabs can be multiple pages to handle object alignment. See "pagesperslab" in /proc/slabinfo, typically 1, 2, 4, or 8 pages per slab.
SLUB is the default slab allocator and the most versatile by far. SLAB and SLOB seem to be rarely used except on niche platforms.
SLUB allocator merges similar slab pools (unless slub_debug options are enabled).
- If five different subsystems all want to allocate (different) 128-byte objects with no special properties, they don't each get separate slab types with separate slabinfo entries; instead they are all merged into one slab type and thus one slabinfo entry.
- That slabinfo entry normally shows the name of one of them, probably the first to be set up, with no direct hint that it also includes the usage of all the others.
SLUB debugging/tracing/validation can be enabled either globally or for individual slab pools. However, this incurs overhead and is disabled by default.
- Some of the debugging options are only supported when present from boot, via the boot option slub_debug=<debug_options>:<slab_name>.
- To dynamically debug after boot, see the writable files under /sys/kernel/slab/<slab name>/: cpu_partial, min_partial, remote_node_defrag_ratio, shrink, validate
- See section "Some more sophisticated uses of slub_debug" in the slub kernel docs https://www.kernel.org/doc/Documentation/vm/slub.txt

Observability for slab allocations

Unintuitive gotchas in the slab reporting:

Slab pool merging:
- Improves efficiency but makes interpretation harder.
- Combining N caches into 1 makes cache usage attribution impractical within a pool.
Not all slab names are shown. Only one name per merged cache. Makes it harder to identify all relevant caches.

Tools list and notes

slabtop utility - Top N slab pools, sorted by size by default.
- Gotcha: Does not show all slab pools! It limits output to terminal window size. Only way to increase N is to grow terminal window.
- Redirecting output to a file or pipe makes it worse -- shows only 23 rows. That bug is fixed in new versions of package "procps-ng" (3.3.16), available on Ubuntu Focal (20.04) but not Bionic (18.04).
/proc/slabinfo - Machine-parsable summary of slab pools.
- Does not compute pool sizes (slab size * number of slabs).
- Does include all slab pools.
- Still shows only the first alias of merged slab pools. This makes it hard to infer semantics of the pools' usage, since the SLUB allocated aggressively merges slab pools.
slabinfo utility - A useful binary in the kernel source: tools/vm/slabinfo.c
- Very flexible exploratory tool. Too bad it is not packaged conveniently.
- Build: make -C $LINUX_SRC/tools/vm slabinfo
- Can show all aliases.
/sys/kernel/slab/*
- Kernel API. Each slab pool has its own directory, and aliases are symlinks.
- Supports enabling tracing per slab pool, although this is risky for very active slabs since the debug tracing messages can have a very high rate.

Tracepoints for slab allocator

Observability for page allocations

Edited Jun 16, 2023 by Matt Smiley