2023-04-05: PatroniServiceRailsReplicaSqlApdexSLOViolation

Customer Impact

Gitlab.com experienced slow response times for the Web and API services between 08:07 and 10:25 UTC on 2023-04-05.

Current Status

More information will be added as we investigate the issue. For customers believed to be affected by this incident, please subscribe to this issue or monitor our status page for further updates.

2023-04-11 08:22: UTC The incident is mitigated. We reverted #8648 (closed) and continued to note the stability of the Patroni cluster. Investigation for the root cause is still ongoing.

Brief summary of findings

This apdex regression for database query latency was driven by the following causal chain:

Contention over the lock_manager lwlock increased as the query rate per replica node increased. This contention increased query duration by delaying their ability to acquire heavyweight locks on tables and indexes.
This slowness limited query throughput, and as demand rose above capacity, the pgbouncer backend connection pools saturated.
No one particular query or table or application endpoint was associated with the performance regression, because most queries were stalling over the same shared resource: access to the shared-memory lock tables, guarded by the lock_manager lwlocks. Queries that required more locks may have been more prone to the effect, but a large variety of queries were affected by the contention. This is because most of our workload's queries (approximately 60-70%) need to acquire at least one heavyweight lock via the shared-memory lock table (i.e. slowpath locking).
The performance regression was directly caused by the lock_manager lwlock contention throttling the transaction rate. As that contention rate increased, so did the severity. At times, hundreds of backends were concurrently waiting for one of the 16 lock_manager lwlocks.
The contention lead to query duration increase, which lead to connection pool saturation.
The proximal cause of the regression was reducing the number of replica nodes from 8 to 6, which increased the query rate per replica node. This capacity reduction was assumed to be safe, since a few weeks earlier, 6 replicas had been sufficient.
The lower level root cause of the regression was that the rate per replica node of acquiring heavyweight locks via slowpath had increased to the point where during peak workload, many concurrent transactions frequently had to pause to await the lock_manager lwlock.

Running more queries per replica node is what pushed the lock acquisition rate over its tipping point. And re-adding replicas resolved the problem for the short term. As next steps, we also want to identify what factors contribute to the contention. See scalability#2301 for details.

Several things can cause an increase in the rate of lock_manager lwlock acquires per transaction. We do not know which of those possible things occurred in the weeks since 6 replicas was enough. Likely candidates include adding a join to a frequently run query, splitting a series of queries into multiple transactions, adding indexes to a frequently queried table, etc. But it doesn't matter what the last straw was; it probably isn't the largest contributor. What matters next is finding efficient ways to either reduce demand for this finite resource or add capacity to accommodate demand. Having uncovered this as a saturation point, we can now assess the major contributing factors and find optimization opportunities to reduce slowpath locking.

📝 Summary for CMOC notice / Exec summary:

Customer Impact: Slow response time on web and API services
Service Impact: ServicePatroni ServiceWeb
Impact Duration (UTC): 08:07 - 10:25 (2hrs 18mins)
Root cause: Contention over lock_manager lwlocks.

📚 References and helpful links

Recent Events (available internally only):

Feature Flag Log - Chatops to toggle Feature Flags Documentation
Infrastructure Configurations
GCP Events (e.g. host failure)

Deployment Guidance

Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Edited May 09, 2023 by Kennedy Wanyangu