Investigate CPU saturation intermittently affecting main db primary node

Concern

The main db's primary node exhibits a significant amount of CPU scheduling delay.

This indicates CPU starvation is likely already occurring, even though the conventional average CPU utilization metric does not yet approach 100%.

This CPU scheduling delay occurs regularly:

Increases during the peak hours of the normal weekday workload.
Tends to spike at the start of each hour.
Abruptly increases as CPU utilization starts exceeding 50%. (Might relate to when cores have to start using both hyperthreads?)

To illustrate, the following graph shows a typical week, comparing active CPU time (yellow) versus CPU scheduling delay (green) on the main db's primary node. (For more details, see: #3803 (comment 2095155402).)

This inferred CPU saturation may be a contributing factor to some recent incidents where main db performance regression led to customer-facing apdex drops. Anecdotally, in at least 2 of those incidents, this specific db node became a performance bottleneck while it was suffering from CPU scheduling delay spikes. We do not yet know if this CPU starvation was a contributing cause or just a side-effect of those incidents.

What makes this important?

Hard to scale: Since this affects the db's primary node, we cannot simply add more nodes to gain capacity. Instead, to reduce saturation risk, we need to better understand what is driving the demand, so we can identify feasible ways to improve efficiency (probably through both infrastructure and application changes).

May amplify other forms of contention: This kind of delay can potentially indirectly exacerbate other forms of performance degradation. For example, if a process gets switched off of a CPU while it is still holding a contended LWLock, any scheduling delay for giving it another timeslice would increase the duration of the LWLock contention, blocking any other processes waiting for that LWLock. Small amounts of scheduling delay are normal and tolerable, but large amounts can adversely affect higher layers of resource management. (More generally, when a job has to wait for one contended resource while holding another, interaction effects between waiters can adversely affect their collective performance.)

Capacity forecasting: This may warrant making a new saturation metric. We already have alerting for the CPU utilization metric's long-term growth trend, and that has been giving us useful advance warning. But this amount of CPU scheduling delay was not expected yet (as far as I know).

Scope of impact

Among the production database nodes, currently only the main db's primary node experiences significant CPU scheduling delay.

This probably hints that a specific aspect of that role's workload is driving these spikes.

Goals

This issue will attempt to discover:

What is consuming most of the CPU time on the main db's primary node? Specifically look for what (if anything) distinctive occurs during spikes in CPU scheduling delay, when we suspect CPU usage is intermittently saturating.
Are microbursts of CPU usage occurring? Is that driving the scheduling delay spikes? We already have some evidence of this from studying the non-peak workload (see production#18505 (comment 2094716776)), but we need to check if those findings also match the larger and more erratic spikes during the peak weekday workload.
Identify some corrective actions. Consider a variety of options, maybe including: changing machine family or processor type, reevaluating hyperthread pros/cons, reduce query optimizer overhead for queries using jsonb fields, smooth across time the workload that currently drives demand bursts, offload the expensive queries to replicas, etc. The ongoing analysis should give us a clearer picture of the bottleneck, where its resources are spent, and which kinds of mitigations will be most feasible.

Results

Through excellent cross-team collaboration between groupscalability, groupdatabase, and groupauthorization, we achieved an outstanding efficiency improvement for one of our most crucial and chronically strained resources -- the main db's primary node.

Results highlights:

Resolved microbursts of CPU saturation and its associated stall condition: CPU scheduling delay spikes. Postgres now has much smoother CPU usage. Details: #3803 (comment 2186733088)
Reduced 2 flavors of LWLock contention: LockManager and BufferMapping. Avoiding the scheduling delay spikes has removed one of the factors driving the long tail of these LWLocks' held duration. With more consistently short durations, contention occurs less often and resolves faster. Details: #3803 (comment 2193685647)
Reduced peak CPU usage by over 40 vCPUs, regaining that capacity as headroom. Additionally, this efficiency improvement will help Dedicated, Cells, and self-managed customers as well, although the margin of improvement will likely be less than on gitlab.com due to the workload and data distribution. Details: #3803 (comment 2190069937)

Additionally, this discovery of inefficient query planning overhead was so significant (e.g. see #3803 (comment 2176872893)) that we intend to follow-up with work to prevent a similar regression and to make future discovery easier for related inefficiencies.

As a quick visual aid, the graph below shows CPU usage, capacity, and scheduling delay on the main db's primary node. More details are in #3803 (comment 2186733088) and #3803 (comment 2190069937), but briefly, this graph illustrates the headroom we gained:

On 2024-09-23, the hardware upgrade increased capacity by 38% -- adding 48 vCPUs, up from 128 to 176 vCPUs. This reduced CPU scheduling delay by roughly 50% (down from daily peaks of 30-60 stalled seconds per wallclock second).
On 2024-10-21, the query efficiency improvement reduced demand by over 50% -- dropping peak usage from 80-120 vCPUs down to roughly 40 vCPUs. This eliminated the remaining CPU scheduling delay, dropping it to a trivial level (0.1 seconds of stall per wallclock second on a host with 176 vCPUs of capacity).

source

Lastly, a quick pitch for knowledge sharing:

For any future readers interested in learning more about this kind of research... This body of work combines elements from several related domains, including:

kernel's scheduling behaviors and native instrumentation, particularly for the initial discovery and extrapolating its significance
postgres internals, particularly for locking behaviors and the factors affecting lock-held duration at the long tail
ad hoc analysis and observability tooling, particularly for forming and testing hypotheses and augmenting existing metrics
application logic, particularly for preserving semantics while avoiding the newly discovered source of overhead

If these topics interest you, read on, and reach out with any questions!

Edited Nov 11, 2024 by Matt Smiley