Investigate intermittent saturation of lock_manager lwlocks in main db replicas
Goal
Improve our understanding of what factors drive bursts of contention over the lock_manager
lightweight locks in the main
db's replicas.
The factors are likely the same in other contexts, but we are observing such bursts specifically in this workload: read-only queries to the main
db's replicas.
We focus specifically on the lock_manager
type of lwlock because it featured prominently in a recent incident production#8673 (closed), as noted below.
Why this is important
Severe lightweight lock contention may signal approaching a saturation point.
This research began as incident support, but it grew into a broader set of concerns:
- This appears to be an unanticipated saturation point.
- It is not currently part of our capacity forecasting. Reducing the number of active replica dbs was expected to be safe, but it was not.
- The utilization of this resource cannot easily be directly measured due to instrumentation overhead. To make a saturation metric, we will need to devise an indirect measurement approach, probably using a cheaper surrogate metric. If we cannot find such a metric, a fallback would be to treat the actual contention events as a signal, but as a lagging indicator.
- This saturation has caused at least one recent incident (and likely more).
- Future incidents involving this saturation point may potentially be triggered by otherwise benign changes (e.g. adding new indexes to frequently queried tables, partitioning large tables, adding queries as part of feature development that coincidentally happen to push the cumulative utilization passed the tipping point).
- Understanding the factors driving this resource contention may influence architectural decisions (e.g. table partitioning, machine-type choice, pace of moving queries off the primary to the replicas, minimum replica count as a function of query rate).
Background and motivation
During incident production#8673 (closed), we incurred several hours of severe and unexpected query performance regression on some but not all replicas of the main
db. This slowness led to saturating the affected replicas' pgbouncer backend pools:
That connection pool saturation caused the usual cascade of SLO violations for user-facing apdex and error rates.
Why did those pgbouncer backend connection pools saturate?
The precipitating event was reducing the number of active client-facing replica dbs from 8 to 6. Spreading the workload among fewer replicas pushed up the query rate for each replica (see production#8673 (comment 1349082533)). This was expected to be safe -- only a few weeks earlier, there had been only 6 active replicas, and the overall query rate had not increased significantly. On the day of this example incident (2023-04-05), the query rate per active replica peaked at 60K queries/second. That query rate is not unprecedented, but in this case it led to an unexpected spike in the mean query duration (see production#8673 (comment 1349076788)):
Mean query duration spiked on affected replicas (most severely on patroni-main-2004-03
):
LWLock contention
Those spikes in query duration (and the resulting pgbouncer backend connection pool saturation) correlate with spikes in queries stalled waiting for a lock_manager
lightweight lock:
The above graph shows examples captured by the marginalia sampler, which captures point-in-time snapshots of the state of concurrently running queries and records any wait-event that each query is currently stalled on.
Lightweight locks are normally acquired and released on a timescale of microseconds, so the fact that we observe contention at all suggests it is pretty severe.
In particular, this incident's most contended lwlock was the tranche of lock_manager
lwlocks. They mediate acquiring the normal heavyweight locks on tables and indexes. Virtually every query needs to be able to acquire those lwlocks. And there are only 16 of them, each guarding the shared data structures for the heavyweight locks on 1/16th of the database's tables and indexes.
While any one backend is holding a lock_manager
lwlock, if it stalls for any reason, that can quickly escalate to blocking many other concurrent transactions. This tendency for the contention to cascade rapidly explains why the graph above tends to show either no contention or hundreds of affected backends waiting for this type of lwlock.
If the holder of a lock_manager
lwlock stalls for a few microseconds, that is ok and normal. But stalling for hundreds or thousands of microseconds is long enough for other transactions to need the same lwlock. Since the waiters each require this lwlock in exclusive mode, they all have to wait for the current holder to release it. This snowballs quickly when the lwlock is held for long enough to approach the mean query duration -- at that point it is very likely for many other transactions to stall waiting for that lwlock to become available.
lock_manager
LWLock contention
Trend of stalls due to The incident on 2023-04-05 was the most severe contention in the last several weeks, but it is not unprecedented.
We may be hovering near saturation on normal days. So this is somewhat urgent to understand.
We know one sure mitigation is to add replica dbs, spreading the workload among more postgres instances. As a short-term mitigation, that can avoid incidents. But we also want to address capacity planning and long-term plans like table partitioning (which will greatly help other things but may increase this particular contention).
For context, here is the 2-week trend of stalls due to these lock_manager
lwlocks being unavailable:
Why did the max query rate per replica go down compared to a couple months ago?
Query rate is only an approximation of lock acquisition rate.
Based on analysis of the locking code, the simplest explanation is that changes to our queries and/or schema have increased the average locks per query, such that lock_manager
lwlock contention begins throttling query throughput sooner.
More details are in the next sections.
lock_manager
lwlock acquisition rate?
What influences The acquisition rate of lock_manager
lwlocks is effectively the number of relations per second that both:
- cannot be locked via fastpath
- have not already been acquired earlier in the transaction
So to reduce contention on the lock_manager
lwlocks, we need to reduce the slowpath acquisition rate of heavyweight locks per postgres instance. This can be done via any combination of:
- Add more replicas to reduce the query rate per postgres instance.
- This option adds capacity; all other options aim to reduce utilization if practical.
- Reduce query rate.
- Can use any of the usual methods (e.g. results caching, shifting queries to the more scalable fleet of replicas, etc.).
- Increase the likelihood that frequently run queries can use fastpath locking:
- Let very frequent queries run in their own transaction, if possible. Each transaction only gets up to 16 fastpath locks; any locks above that count must use the slowpath.
- Caveat: Splitting a series of queries into separate transactions also means potentially running them on different replicas. If the application semantics require a strong guarantee that the 2nd query is not seeing an earlier point in the transaction stream than the 1st query, then keep them together in a single transaction to ensure they always run on the same replica.
- Minimize the number of distinct tables joined by very frequently run queries. Each table and its indexes requires a separately acquired heavyweight lock.
- Avoid extra indexes on very frequently queried tables. Currently 17 tables have more than 16 indexes, so every query involving those tables will have to use slowpath locking for at least some locks.
- Example: The
projects
table has 57 indexes, so a transaction needs 58 locks to query it, of which at most 16 can use fastpath.
- Example: The
- Let very frequent queries run in their own transaction, if possible. Each transaction only gets up to 16 fastpath locks; any locks above that count must use the slowpath.
Background on locking internals and how fastpath works
The lock_manager
lightweight lock guards access to the shared memory data structures for tracking which heavyweight locks are held by which transactions.
"Heavyweight" locks are the normal locks on tables and indexes. They are held for the remaining lifespan of a transaction (milliseconds or longer).
In contrast, "lightweight" locks (lwlocks) like lock_manager
are typically held for only a few microseconds at a time. They provide concurrency control for backends executing short critical sections of code, such as manipulating a shared memory data structure.
To acquire a heavyweight lock, the postgres backend running the current transaction has two options: a special case fastpath or a general case slowpath. The slowpath is where our lwlock contention occurs, so let's walk through it first.
The "slowpath" has to update a shared-memory lock table. This means briefly acquiring the lock_manager
lwlock in exclusive mode to perform that update. If the lock_manager
lwlocks are under heavy contention, this slowpath can stall and starve for long enough to throttle query throughput -- which leads to poor performance and apdex degradation.
The slowpath described above can be skipped if certain "fastpath" conditions are met.
The "fastpath" heavyweight locking allows the backend to try to acquire a heavyweight lock without updating the shared memory hash table. Prerequisites include:
- The requested lock type is
relation
(i.e. table, index, view, etc.). - The requested lock mode is "weak" (less than
ShareUpdateExclusiveLock
), which includes:AccessShareLock
,RowShareLock
,RowExclusiveLock
. - No other transaction has already locked the same relation in a "strong" mode (greater than
ShareUpdateExclusiveLock
), which includes:ShareLock
,ShareRowExclusiveLock
,ExclusiveLock
,AccessExclusiveLock
. - Less than 16 fastpath locks are already held by this transaction (FP_LOCK_SLOTS_PER_BACKEND=16).
Each transaction only gets to use the fastpath for up to 16 heavyweight locks; all additional locks must use the slowpath.
Our production workload's locking characteristics
Judging from a 2-hour survey, our production workload on the read-only replica nodes meets the first 3 prerequisites most of the time. The last point appears to be what usually causes transactions to have to use slowpath locking. (This result is not surprising, but it was important to verify.)
Nearly 70% of our heavyweight locks are acquired via slowpath, affecting nearly 60% of our transactions. These transactions need more than 16 heavyweight locks, so only a portion of locking can use fastpath.
Sometimes this is due to running multiple queries on different tables in the same transaction. Sometimes it is due to a single table having 16 or more indexes (all of which must be locked by the planner when choosing which execution plan to use).
Generally we do not expect engineers to evaluate locking overhead when designing queries or schema changes. Usually other optimization goals are more important. We are only presently focusing on contention because it is effectively close enough to act as a saturation point.
Adding capacity is the simplest remedy. Beyond that, we could try to improve efficiency by reviewing the most frequently run queries to see if there are opportunities to reduce their required slowpath lock count.
Conclusions
The two biggest factors driving lock_manager
lwlock contention in our workload are:
- query rate per postgres instance
- proportion of the query's heavyweight locks (i.e. normal locks on tables and indexes) that are acquired via slowpath rather than fastpath
The 1st factor is influenced by:
- node count: Most read-only queries can be spread across a horizontally scalable pool of active replicas. However, this is not an option for queries that must run on the primary node (which includes all writes and certain reads that cannot tolerate staleness from async replication lag).
- app changes: Application changes can add db queries (e.g. for feature changes) or remove queries (e.g. by caching results).
- queries per transaction: App changes that split a transaction to avoid holding locks during a potentially long delay will have to reacquire those locks for the 2nd transaction. This avoids long-lived transactions (good) but increases the slowpath lock acquisition rate (bad). This trade-off is almost always worthwhile; just acknowledging it here as a contributing factor.
The 2nd factor is influenced by:
- index count on frequently queried tables: All indexes on an accessed table must be locked.
- join count in frequently run queries: All joined tables must be locked.
- queries per transaction: Tables already locked in the same transaction do not incur additional locking overhead for subsequent queries, but long transactions are harmful to throughput in that they prevent pgbouncer from returning the backend to the pool for other clients to use.
- table partitioning: Each accessed partition and its indexes must be individually locked. If any queries do not include a filter on the partition key, such queries will acquire locks on all partitions. So partition pruning is crucial to avoid increasing the rate of slowpath lock acquisition.
Preventative actions
-
https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23490 - Add at least 1 more replica node.
- Helps by reducing the daily level of
lock_manager
contention. This is measurable via our existing wait_event sampling. - Helps by reducing the likelihood of another performance regression incident if any of the existing replica nodes fails. Currently I suspect losing even one replica would push the query rate and contention level high enough to throttle query throughput and cause an incident. For resilience, we generally want at least N+1 (ideally N+2).
- Does not help with the same risks on the primary db node. To reduce that risk, we need additional mitigations.
- Helps by reducing the daily level of
- Treat
lock_manager
lwlock contention as a saturation metric.- Ideally we would like to have a metric that we can treat as a percent utilization. However, currently we have no way to efficiently directly measure
lock_manager
acquisition rate and held duration, which could answer the question of what percentage of wallclock time those lwlocks are held. - In lieu of that, we could use surrogate metrics like query rate, but that alone is not a reliable predictor of saturation.
- However, with wait_event sampling, we get a preview of how much contention is actually occurring. Since a moderate level of contention is tolerable, we can treat it as a precursor hinting at more severe (and impactful) contention. That will hopefully give us enough advance warning to take one of the preventative actions:
- Add another replica.
- Search for frequently run queries that don't currently get to use fastpath locking.
- Search for indexes to drop from frequently queried tables.
- Ensure partitioned tables are queried in a way that is compatible with partition pruning.
- Ideally we would like to have a metric that we can treat as a percent utilization. However, currently we have no way to efficiently directly measure