Reduce lock_manager lwlock contention by increasing client-facing replica count for main db

Summary

Proposal:

Add 1 more replica of the main db, to reduce contention over lock_manager lwlocks.

Analysis of incident production#8673 (closed) has convinced me that our main db's replicas are probably hovering near a critical saturation point. Until that incident, we did not know how close we were to saturation. Now we know that lock contention over the lock_manager lwlocks can transition from a minor to a major bottleneck if we lose a single replica node or if a deploy changes the structure of a frequently run query in a way that increases lock acquisition rate.

Adding another replica is the simplest remedy, and doing so will also let us gather useful data to help prevent and mitigate future incidents.

Risk assessment summary:

We know from the incident that losing 2 replicas definitely causes a major performance regression. I suspect the probability is high that losing 1 replica would also cause an incident.

I don't have a good way to estimate whether or not each rails deployment is likely to cause similarly cause an incident due to increasing the lock acquisition rate. But circumstantial evidence suggests that this has happened at least once in the last month.

Brief background

Every weekday, our production workload on the main db's replicas suffers from intermittent bursts of contention over the lock_manager lwlocks.

This particular kind of contention tends to escalate very quickly to stall many transactions, because this lwlock is usually acquired in exclusive mode. The more often it happens, the more it degrades query throughput.

To illustrate, this graph shows the last 2 weeks as a typical example of how many postgres backends concurrently stall due to this contention:

source

The above graph's gauge metric is captured only a few times per minute, so it represents a small fraction of the real number of stall events. This version of the graph shows the largest spike in each 10-minute interval, highlighting that this type of stall event can often escalate to stall most of the active backends.

Some amount of contention is normal, but stalls that affect many backends can have an outsized impact.

In incident production#8673 (closed), raising the query rate per replica by 20% (from 50K to 60K queries per second) pushed this lock contention passed its saturation point, changing it from a minor influence to a major bottleneck:

source

The tipping point is currently probably still somewhere between 50K and 60K queries per second, but it can potentially change after any deploy. The factors influencing the tipping point are discussed in the research issue scalability#2301 (closed). The conclusions from that research suggest that we increase the replica count and observe how much the contention rate reduces.

Will the machine-type upgrade also help?

Yes, the ongoing work to upgrade the machine-type for these VMs will also help by reducing the mean duration of holding the lock_manager lock. The newer CPUs' higher instruction throughput should help reduce the duration of holding the lock_manager lwlocks, reducing the frequency and duration of contention.

However, it may not help much with the long-tail stall events where the lwlock takes thousands of microseconds to become available. Those long-tail stalls may be driven by the lock-holder getting context-switched off CPU while still holding a lock_manager lwlock. That kind of stall is less likely to occur when there are fewer runnable processes competing for CPU timeslices. Reducing the query rate per postgres node may help with that kind of stall; adding another replica to the pool is the simplest way to do that.

Do we need to add a replica even though we are upgrading machine-type soon?

I would suggest provisioning 1 extra replica as soon as possible (prior to upgrading the machine-type), and later reevaluating the replica count after the machine-type upgrades are complete.

Adding a replica now (while contention is directly observable) will give us a few benefits:

Reduces the imminent risk of another contention-induced incident. Hovering near the tipping point means we are at risk from any of several factors that influence the lock acquisition rate, including losing a db node, adding an index to a frequently queried table, adding a join to a frequently run query, etc.
Confirms that we can use the existing metric for capacity planning. We cannot directly measure the lock acquisition rate, but we can and do measure a sample of its contention via polling for wait_events. We suspect this will be a more reliable predictor of impending risk than the other surrogate metrics we looked into. If increasing replica count reduces the contention measured via wait_events, we can be confident this is a good signal.
Tests that this is an effective mitigation strategy. As the business and workload continues to grow, we are certain to hit this bottleneck again, and having a ready solution that does not require extensive analysis of application behavior should help avert what would otherwise be a crisis.

Related Incident(s)

Originating issue(s):

production#8673 (closed) - Incident where reducing replica count from 8 to 6 caused several hours of degraded performance. Spreading the workload among fewer replicas increased each node's query rate and lock acquisition rate. Contention over the lock_manager lwlocks reached severe enough starvation to throttle query throughput and degrade apdex.
scalability#2301 (closed) - Deep-dive investigation on the contention that caused incident production#8673 (closed). That issue's research is what lead to this proposal to increase replica count, as the simplest and cheapest preventative option.

Desired Outcome/Acceptance Criteria

Associated Services

ServicePostgres ServicePatroni

Corrective Action Issue Checklist

Link the incident(s) this corrective action arose out of
Give context for what problem this corrective action is trying to prevent from re-occurring
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
Assign a priority (this will default to 'Reliability::P4')

Edited Apr 22, 2023 by Matt Smiley