Memory over provisioning and OOM strategy for data stores

I stumbled across our memory over provisioning as a root cause for failures in benchmarking and promised to open a separate ticket to discuss the general strategy.

Memory Management Summery

Most software can not predict the exact memory needs during compile time, so memory is dynamically allocated during run time. When a program needs more memory it requests a new chunk of memory from the operating system. The OS can either grant or deny the request. Most modern software does not always care very much about the amount that is actually needed and tries to get a lot more than is currently necessary.

Operating systems can honor this behavior by not actually exclusively reserving memory for a program if a memory request is granted. The OS just hopes that the requestor is not actually using all the granted memory and therefore potentially promises way more memory than is physically available.

If - unexpectedly - the software starts using the promised memory and would exceed the available resources the OS has to act drastically, potential actions are:

Panic, halt the system
Freeze the requesting program
Terminate the requesting program
OOM-Killer, that terminates memory consuming processes

In most cases a so called OOM-Killer is started that terminates processes according to predefined metrics, this could be:

Memory consumption
Arbitrary priority
Combination of memory usage and priority

The common procedure is the following. When the kernel is not able to free enough memory in time the OOM-Killer is started and will kill the process with the highest score until the high memory stress situation is over. So if no tuning is done the process consuming the most memory is most likely to be killed.

Example - Application Server

Think of an application server running hundreds of processes to serve hundreds of clients concurrently. Now a situation occurs in one of the processes which leads to massive memory consumption and this process tries to use more memory than available. Here again the OOM-Killer is activated and will most likely kill the problematic process and resolve the current situation with minimal impact.

Example - Database Server

Think of a PostgreSQL database server that has hundreds of processes serving hundreds of simultaneous users. One or more users have a high memory consumption and in sum all process want to consume more memory than the OS can provide fast enough, the OOM-Killer is started. Most likely one of the PostgreSQL processes executing queries is terminated. Unfortunately this leads to catastrophic failure. PostgreSQL processes can access shared memory segments (shared_buffers), when one process is terminated without being able to “clean up” the Postmaster can not be sure that the shared memory is still consistent and therefor the whole service needs to restart immediately, aborting all queries and terminating all connections. After this unclean shutdown PostgreSQL needs to perform a recovery on the next start which can lead to long unexpected downtimes.

Problem

When we experience strong memory pressure on one of the database servers it is possible that the OOM killer will terminate a PostgreSQL process. PostgreSQL will than shutdown immediately. If this happens on the master it will cause downtime.

Simple Solution

The solution proposed in the PostgreSQL documentation is quite simple. The memory over commit functionality is only needed for software with loose memory management and is therefor deactivated on database servers. On Linux this is done by setting vm.overcommit_memory = 2 and tuning vm.overcommit_ratio. Now the kernel will deny a memory request if it is obvious that not enough memory is available for fulfillment. PostgreSQL handles this gracefully. If a query needs more memory, new dynamic memory is allocated. If the allocation fails - due to the “don’t over commit policy” - the corresponding query is terminated and the involved memory is freed. So only one query from one user is affected here.

This is the default and expected behavior for PostgreSQL and many other database management systems, as mentioned in the documentation.

Alternative Solution

There are etch cases when disabling is not possible, for example because the database system is hosted on k8s or because the system runs a mixed workload including software with loose memory management claiming ridiculous amounts of memory.

A mitigation can be to configure the OOM to not only base the decision to kill a process on memory consumption, but introduce a new metric to decrease the chance of essential processes being terminated. This can be done by setting the oom_score_adj = -1000 for the memory groups of PostgreSQL and Patroni. In our case via systed's OOMScoreAdjust.

Edited May 18, 2022 by Alexander Sosna