Reference notes on Linux kernel memory pressure accounting (PSI)

These notes started as an impromptu Slack discussion between @mkaeppler @pks-t and @msmiley. We learned some useful facts about the memory pressure accounting and how to interpret it in the context of our current production environment's workloads. The following notes summarize the bits that might be interesting/useful to other folks in the future.

Intro

When researching trends in memory pressure metrics, we typically need to interpret from context what factors may be influencing an abnormally high pressure measurement. High pressure suggests potential contention over a specific resource (cpu, memory, block IO). Such contention can potentially cause latency spikes for some processes on the affected host or cgroup. It can also potentially indirectly lead to changes in other resource usage patterns (e.g. memory pressure can erode the filesystem cache, leading to increased block IO for workloads whose cache hit rate drops).

In support of analyzing memory pressure spikes, here we summarize:

the kernel's model for computing its pressure metrics
the specific events and conditions that kernel accounts as contributing to pressure

How might this help with root cause analysis?

Trending the pressure metrics tells us that one or more of the accounted conditions is exhibiting contention under the workload and system state at the time.

The pressure measurement alone does not tell us which code paths or events are contributing to that pressure, but knowing how the accounting works can guide our next steps in the analysis. For example, next steps may include:

review other related metrics
add dynamic instrumentation to suspected hot code paths to quantify the count and duration spent in a contended state (see methodology and demo: #1825 (comment 1033234488))
attempt to synthetically reproduce the conditions driving the pressure trend

Kernel's Pressure Stall Info (PSI) model

The model for how the pressure metrics are measured and aggregated is covered in detail here in the kernel source: kernel/sched/psi.c.

Concisely summarized:

3 resources are accounted by the pressure stall model: cpu, memory, and io. (The same model applies to all 3, but in later sections of these notes we will focus specifically on memory.)

Pressure represents what percentage of wallclock time the otherwise runnable tasks could not execute on CPU due to resource contention.

The pressure metrics represent 2 levels of stall:

"SOME" stall means at least one task was delayed by a stall event.
"FULL" stall means at least one task was delayed by a stall event, and also no tasks were running.

From the code comments:

SOME = nr_delayed_tasks != 0
FULL = nr_delayed_tasks != 0 && nr_running_tasks == 0

The amount of wallclock time spent in a stalled state is intended to represent a percentage of the potential work to be done. Consequently it is scaled by either the number of CPUs or the number of non-idle tasks, whichever is less.

From the code comments:

threads = min(nr_nonidle_tasks, nr_cpus) 
SOME    = min(nr_delayed_tasks / threads, 1)
FULL    = (threads - min(nr_running_tasks, threads)) / threads

Certain kernel code blocks are accounted as a stall event for the PSI metrics.

The stall events for memory pressure accounting begin and end with calls to: psi_memstall_enter and psi_memstall_leave.
The wallclock time spent in those blocks is tracked in per-CPU counters.
Those counters are periodically aggregated into the global metrics. The aggregation windows are 10, 60, and 300 seconds.

Here we see how the stall durations and non-idle time are tracked per CPU, then aggregated, and then converted into a percentage over the averaging window. From the code comments:

For each runqueue, we track:

    tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0)
    tFULL[cpu] = time(nr_delayed_tasks[cpu] && !nr_running_tasks[cpu])
 tNONIDLE[cpu] = time(nr_nonidle_tasks[cpu] != 0)

and then periodically aggregate:

 tNONIDLE = sum(tNONIDLE[i])

    tSOME = sum(tSOME[i] * tNONIDLE[i]) / tNONIDLE
    tFULL = sum(tFULL[i] * tNONIDLE[i]) / tNONIDLE

    %SOME = tSOME / period
    %FULL = tFULL / period

What counts as a "memory stall", for purposes of the memory pressure calculation?

Briefly, it looks like these events are the only ones that count as a "memory stall" (based on initial review of the call sites shown next):

background compaction (kcompact) to defragment (coalesce) free pages of memory
direct (foreground) memory compaction or reclaim
memory reclaim due to memory cgroup being over limit (including normal compaction and reclaim, plus any throttling to slow down an aggressive allocator)
thrashing pages of an mmapped file that are part of the working set
throttling delay in a block IO cgroup when over budget and specifically cited as a memory stall (maybe used for tmpfs?)
reading a block back into memory when it is flagged as part of the working set

We can directly observe these events by instrumenting calls to the PSI memstall accounting function. Several of the above types of events are visible in the demo: #1825 (comment 1033234488)

Supporting details

The per-CPU counters that underlie the aggregated metrics in /proc/pressure/memory are maintained by accounting certain code blocks as a memory stall.

Those code blocks are wrapped by calls to psi_memstall_enter and psi_memstall_leave.

As of kernel v5.4, there are 8 call sites to that pair of accounting functions.

These are the call sites:

block/blk-cgroup.c: In blkcg_maybe_throttle_blkg, if a block cgroup is configured with a throttling delay and if requested to account this as a memory stall, then do so.
block/blk-core.c: In submit_bio, if the block IO request is a read op with the flag BIO_WORKINGSET, then account the time spent submitting the request as a memory stall.
mm/compaction.c: The kcompact kernel thread accounts its entire call to kcompactd_do_work as a memory stall. Defragmenting memory always counts as a memory stall. Migrating pages to defragment free space within each zone requires updating page tables. I suspect this is accounted as a stall state because it briefly blocks traversal of the affected page tables while the physical page migrations are occurring, which naturally must block virtual-to-physical address lookups. I have not yet dug into the kcompactd_do_work code; just rationalizing why it is accounted as a stalled state. For context, coalescing free pages as a background task makes it faster to satisfy future foreground memory allocation requests, particularly large multi-page requests; it is an optimistic background task aiming to avoid foreground stalls during future page faults.
mm/filemap.c: In wait_on_page_bit_common, if the requested page is currently locked, not up-to-date with the backing file block, and part of the working-set, then treat this as thrashing and accounts it as a memstall.
mm/memcontrol.c: In mem_cgroup_handle_over_high, in the special case where a memory cgroup's limit is exceeded and reclaim cannot keep up, then throttle the allocator by stalling before returning from reclaim. The penalty delay is proportional to the excess. That penalty is accounted as a memstall.
mm/page_alloc.c: In __alloc_pages_direct_compact, if a memory allocation request cannot be satisfied immediately, attempt foreground memory compaction (coalesce free pages) and account that time as a memstall.
mm/page_alloc.c: In __perform_reclaim, if a memory allocation request still cannot be satisfied after compaction, attempt memory reclaim (e.g. evict from page cache), and account that time as a memstall.
mm/vmscan.c: In try_to_free_mem_cgroup_pages, similar to direct reclaim, but within a memory cgroup. When cgroup is over budget, the call to do_try_to_free_pages is accounted as a memstall.

Edited Jul 21, 2022 by Matt Smiley