Understand lock_manager LWLock contention
LWLock saturation is an increasing risk for Gitlab.com stability. We've seen replicas saturating this metric and queuing queries, see gitlab-com/gl-infra/scalability#2301 (closed) for an excellent background.
In short, when a transaction (or single query outside a transaction) touches more than 16 tables + indexes, it takes a slow path for lock_manager
lightweight locks, and when we have a huge number of queries taking this slow path, they can back up and cause slow query response times. This has happened at least twice on replicas as of the writing of this issue, leading to significant performance degredation.
In order to reduce the contention rate for this resource, we need to understand a few things:
-
Is the impact of a query static once it exceeds the 16 objects for fastpath locking, or is it proportional to the number of objects locked? - This is proportional to the number of objects locked
-
Which queries produce the most slowpath locking traffic, both on primaries and replicas? - Queries on namespaces contribute heavily to total lightweight locking time, based on analysis in https://ops.gitlab.net/gitlab-com/database-team/lwlock-analysis (internal only). Indexes on the namespaces table should be reduced to reduce total lightweight lock saturation.
Edited by Simon Tomlinson