Incident Review: GitLab.com slow and not loading some elements

INC-4589: GitLab.com slow and not loading some elements

INC-4641: Patroni main SLI dropping

Generated by André Luís on 8 Oct 2025 14:12. All timestamps are local to Etc/UTC

Key Information

Metric	INC-4589	INC-4641
Customers Affected	GitLab.com users, 5 customer tickets	GitLab.com users
Requests Affected	86% of all traffic dropped	89% of all traffic dropped
Incident Severity	Severity 1 (Critical)	Severity 1 (Critical)
Impact Start Time	Tue, 07 Oct 2025 17:15:00 UTC	Thu, 09 Oct 2025 17:13:00 UTC
Impact End Time	Tue, 07 Oct 2025 17:30:00 UTC (15 minutes)	Thu, 09 Oct 2025 17:33:00 UTC (18 minutes)
Total Duration	1 hour, 47 minutes	1 hour, 4 minutes
Link to Incident Issue	https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20688	https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20705

Summary

Problem: On October 7 and 9, GitLab.com experienced complete database saturation across the patroni-main cluster due to an exceptionally high volume of API requests from a scheduled CI pipeline querying large organizational projects.

Impact: Between 17:15 and 17:30 UTC on October 7 and 17:13 and 17:33 UTC on October 9, GitLab.com was unavailable for users and many experienced elevated 502 errors. Other users experienced slow performance and elements that would not load. Five customer tickets were opened about the disruption.

Causes: The root cause was an expensive database query joining over a dozen tables that, when executed at high concurrency, created severe LwLock:BufferMapping contention and CPU saturation on replica nodes, eventually cascading to the primary database node. Contributing factors included insufficient rate limiting for this computationally expensive endpoint and lack of query optimization for large organizational hierarchies. A single user triggered the high volume of requests through a scheduled pipeline.

Response strategy: For the first incident, service recovered without direct intervention but the root cause was not immediately apparent. Internal investigation was still underway when the second incident occurred. The second incident was resolved by blocking the user who triggered the pipeline and communicating directly with them to disable the scheduled pipeline. Immediate actions include implementing endpoint-specific rate limiting for affected APIs and reviewing rate limit configurations for high-volume consumers. Short-term corrective actions focus on optimizing the query in question to reduce join complexity, considering tiered rate limiting strategies that account for query cost, and implementing circuit breakers for pathological query patterns.

What went well?

We quickly identified drops in traffic and were able to bring in several team members from different departments to investigate from multiple angles.
The investigation after the initial incident was thorough and team members swarmed on identifying the root cause as well as corrective actions.
Some team members noticed the pattern of the recurring pipeline right before the second incident, which allowed us to know immediately what was happening and how to approach stopping it.

What was difficult?

We could see the symptoms of the incident everywhere in our observability, but still had difficulty diagnosing the problem.
While we reached out to the customer after the first incident, we had trouble getting a response. While we did not realize this was an automated pipeline at the time, that delay in communication allowed the second incident to happen.
The endpoint and underlying query in question are normally performant under this load, but this was an edge case of the query that allowed for poor performance.

Combined RCA

Root Cause

The incident was triggered by an exceptionally high volume of API requests to an endpoint for projects within a large organizational namespace. A single user made approximately 4,000 requests per minute over a 10-minute period, querying member information across three projects.

The specific query executed by this endpoint proved particularly expensive for organizations of this scale due to:

Complex Query Structure: The query joins 16 tables to aggregate member information across group hierarchies, shared groups, and project authorizations
Nested Loop Amplification: The execution plan used nested loop joins that repeatedly accessed the same buffer pages for namespaces, group_group_links, and user_group_member_roles tables
Lock Contention: Each query iteration acquired and released BufferMapping LWLocks hundreds of thousands of times, creating severe contention when multiple queries ran concurrently
CPU Saturation: Individual queries consumed 7-8 seconds of CPU time, and the high concurrency level (4,000 req/min) saturated all available database resources

Cascading Failure Mechanism

Replica Saturation: The initial query load hit replica nodes, causing CPU saturation and LwLock:BufferMapping contention
Replication Lag: As replicas became CPU-bound, replication lag increased
Load Balancer Failover: Rails database load balancing detected slow replica responses and shifted traffic to the primary node
Primary Saturation: The primary node became overwhelmed with read traffic it wasn't designed to handle
Connection Pool Exhaustion: PgBouncer connection pools saturated, starving other queries and causing widespread timeouts
Platform Impact: All endpoints sharing the saturated connection pool experienced degraded performance

Contributing Factors

Insufficient Rate Limiting: The current rate limiting configuration for this endpoint was insufficient to prevent a single user from generating enough load to saturate database resources
Query Optimization: The query involved was not optimized for organizations with complex group hierarchies and large member counts
Lack of Query-Specific Safeguards: No specific rate limits or query timeouts were in place for this expensive endpoint

Resolution

The incident self-resolved as the requesting user's API calls completed their pagination cycle and traffic returned to normal levels.

Corrective Actions

Immediate:

Review and adjust rate limiting thresholds for high-volume API consumers to better protect platform stability (https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/27762)
Implement endpoint-specific rate limiting for the endpoint which caused the incident (https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/27764)

Short-term:

Optimize the affected query to reduce join complexity and buffer contention (https://gitlab.com/gitlab-org/gitlab/-/issues/576075)
Introduce caching to improve query performance (gitlab-org/gitlab#513033)
Consider breaking the complex query into multiple simpler statements or using CTEs to constrain join ordering (https://gitlab.com/gitlab-org/gitlab/-/issues/576075)
Implement query-specific timeouts for expensive operations (https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/27765)

Long-term:

Establish more granular, tiered rate limiting strategies that balance customer needs with platform protection
Review and optimize other endpoints that perform complex multi-table joins
Enhance monitoring and alerting for LwLock contention patterns
Implement circuit breakers for queries that exhibit pathological performance characteristics

Lessons Learned

Rate Limiting Must Scale with Query Cost: Rate limits should account for the computational cost of operations, not just request volume. Expensive queries require more conservative limits.
Query Complexity Compounds Under Load: Queries that join many tables can exhibit non-linear performance degradation under concurrent load due to lock contention.
Database Load Balancing Can Amplify Issues: When replicas fail health checks, shifting all traffic to the primary can accelerate rather than mitigate incidents.
Observability Matters: Seeing a problem doesn't mean being able to understand a problem. Observability needs to be designed to drive diagnosis in addition to detection.

Investigation Details

Timeline - 2025-10-07 - INC-4589

Incident Timeline

17:15 - Impact began
17:23 - Incident reported by Cleveland Bledsoe Jr (initially Severity 2)
17:28 - Escalated to Severity 1; status page updated; on-call teams pinged
17:39 - Patroni Main Apdex showing recovery
17:57 - Peak database request volume identified
18:36 - Status page marked as resolved
19:11 - Incident moved to post-incident documentation phase

Timeline - 2025-10-09 - INC-4641

Incident Timeline

17:13 - Impact began
17:14 - Incident reported by Stephanie Jackson (initially Severity 2)
17:16 - Escalated to Severity 1
17:19 - Status page updated; user traffic stopped but database still recovering
17:23 - Offending user account blocked
17:28 - Patroni showing recovery (not yet 100%)
17:31 - Status page updated to "Monitoring" - issue mitigated
17:46 - Confirmed root cause: new CI pipeline created
18:12 - Status page marked as resolved
18:19 - Inciden t moved to post-incident documentation phase

Investigation Notes

Internal investigation was conducted in https://gitlab.com/gitlab-org/gitlab/-/issues/575006 to ensure the privacy of customers and users.

Follow-ups

INC-4589

INC-4641

Process (this section to be removed once completed)

For the person opening the Incident Review

Set the title to Incident Review: (Incident issue name)
Assign a Service::* label (most likely matching the one on the incident issue)
Set a Severity::* label which matches the incident
In the Key Information section, make sure to include a link to the incident issue
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.

For the assigned DRI

Review Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

Edited Oct 14, 2025 by Steve Abrams