Incident Review: GitLab.com slow and not loading some elements
INC-4589: GitLab.com slow and not loading some elements
INC-4641: Patroni main SLI dropping
Generated by André Luís on 8 Oct 2025 14:12. All timestamps are local to Etc/UTC
Key Information
| Metric | INC-4589 | INC-4641 |
|---|---|---|
| Customers Affected | GitLab.com users, 5 customer tickets | GitLab.com users |
| Requests Affected | 86% of all traffic dropped | 89% of all traffic dropped |
| Incident Severity | Severity 1 (Critical) | Severity 1 (Critical) |
| Impact Start Time | Tue, 07 Oct 2025 17:15:00 UTC | Thu, 09 Oct 2025 17:13:00 UTC |
| Impact End Time | Tue, 07 Oct 2025 17:30:00 UTC (15 minutes) | Thu, 09 Oct 2025 17:33:00 UTC (18 minutes) |
| Total Duration | 1 hour, 47 minutes | 1 hour, 4 minutes |
| Link to Incident Issue | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20688 | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20705 |
Summary
Problem: On October 7 and 9, GitLab.com experienced complete database saturation across the patroni-main cluster due to an exceptionally high volume of API requests from a scheduled CI pipeline querying large organizational projects.
Impact: Between 17:15 and 17:30 UTC on October 7 and 17:13 and 17:33 UTC on October 9, GitLab.com was unavailable for users and many experienced elevated 502 errors. Other users experienced slow performance and elements that would not load. Five customer tickets were opened about the disruption.
Causes: The root cause was an expensive database query joining over a dozen tables that, when executed at high concurrency, created severe LwLock:BufferMapping contention and CPU saturation on replica nodes, eventually cascading to the primary database node. Contributing factors included insufficient rate limiting for this computationally expensive endpoint and lack of query optimization for large organizational hierarchies. A single user triggered the high volume of requests through a scheduled pipeline.
Response strategy: For the first incident, service recovered without direct intervention but the root cause was not immediately apparent. Internal investigation was still underway when the second incident occurred. The second incident was resolved by blocking the user who triggered the pipeline and communicating directly with them to disable the scheduled pipeline. Immediate actions include implementing endpoint-specific rate limiting for affected APIs and reviewing rate limit configurations for high-volume consumers. Short-term corrective actions focus on optimizing the query in question to reduce join complexity, considering tiered rate limiting strategies that account for query cost, and implementing circuit breakers for pathological query patterns.
What went well?
- We quickly identified drops in traffic and were able to bring in several team members from different departments to investigate from multiple angles.
- The investigation after the initial incident was thorough and team members swarmed on identifying the root cause as well as corrective actions.
- Some team members noticed the pattern of the recurring pipeline right before the second incident, which allowed us to know immediately what was happening and how to approach stopping it.
What was difficult?
- We could see the symptoms of the incident everywhere in our observability, but still had difficulty diagnosing the problem.
- While we reached out to the customer after the first incident, we had trouble getting a response. While we did not realize this was an automated pipeline at the time, that delay in communication allowed the second incident to happen.
- The endpoint and underlying query in question are normally performant under this load, but this was an edge case of the query that allowed for poor performance.
Combined RCA
Root Cause
The incident was triggered by an exceptionally high volume of API requests to an endpoint for projects within a large organizational namespace. A single user made approximately 4,000 requests per minute over a 10-minute period, querying member information across three projects.
The specific query executed by this endpoint proved particularly expensive for organizations of this scale due to:
- Complex Query Structure: The query joins 16 tables to aggregate member information across group hierarchies, shared groups, and project authorizations
-
Nested Loop Amplification: The execution plan used nested loop joins that repeatedly accessed the same buffer pages for
namespaces,group_group_links, anduser_group_member_rolestables -
Lock Contention: Each query iteration acquired and released
BufferMappingLWLocks hundreds of thousands of times, creating severe contention when multiple queries ran concurrently - CPU Saturation: Individual queries consumed 7-8 seconds of CPU time, and the high concurrency level (4,000 req/min) saturated all available database resources
Cascading Failure Mechanism
-
Replica Saturation: The initial query load hit replica nodes, causing CPU saturation and
LwLock:BufferMappingcontention - Replication Lag: As replicas became CPU-bound, replication lag increased
- Load Balancer Failover: Rails database load balancing detected slow replica responses and shifted traffic to the primary node
- Primary Saturation: The primary node became overwhelmed with read traffic it wasn't designed to handle
- Connection Pool Exhaustion: PgBouncer connection pools saturated, starving other queries and causing widespread timeouts
- Platform Impact: All endpoints sharing the saturated connection pool experienced degraded performance
Contributing Factors
- Insufficient Rate Limiting: The current rate limiting configuration for this endpoint was insufficient to prevent a single user from generating enough load to saturate database resources
- Query Optimization: The query involved was not optimized for organizations with complex group hierarchies and large member counts
- Lack of Query-Specific Safeguards: No specific rate limits or query timeouts were in place for this expensive endpoint
Resolution
The incident self-resolved as the requesting user's API calls completed their pagination cycle and traffic returned to normal levels.
Corrective Actions
Immediate:
- Review and adjust rate limiting thresholds for high-volume API consumers to better protect platform stability (https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/27762)
- Implement endpoint-specific rate limiting for the endpoint which caused the incident (https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/27764)
Short-term:
- Optimize the affected query to reduce join complexity and buffer contention (https://gitlab.com/gitlab-org/gitlab/-/issues/576075)
- Introduce caching to improve query performance (gitlab-org/gitlab#513033)
- Consider breaking the complex query into multiple simpler statements or using CTEs to constrain join ordering (https://gitlab.com/gitlab-org/gitlab/-/issues/576075)
- Implement query-specific timeouts for expensive operations (https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/27765)
Long-term:
- Establish more granular, tiered rate limiting strategies that balance customer needs with platform protection
- Review and optimize other endpoints that perform complex multi-table joins
- Enhance monitoring and alerting for
LwLockcontention patterns - Implement circuit breakers for queries that exhibit pathological performance characteristics
Lessons Learned
- Rate Limiting Must Scale with Query Cost: Rate limits should account for the computational cost of operations, not just request volume. Expensive queries require more conservative limits.
- Query Complexity Compounds Under Load: Queries that join many tables can exhibit non-linear performance degradation under concurrent load due to lock contention.
- Database Load Balancing Can Amplify Issues: When replicas fail health checks, shifting all traffic to the primary can accelerate rather than mitigate incidents.
- Observability Matters: Seeing a problem doesn't mean being able to understand a problem. Observability needs to be designed to drive diagnosis in addition to detection.
Investigation Details
Timeline - 2025-10-07 - INC-4589
Incident Timeline
- 17:15 - Impact began
- 17:23 - Incident reported by Cleveland Bledsoe Jr (initially Severity 2)
- 17:28 - Escalated to Severity 1; status page updated; on-call teams pinged
- 17:39 - Patroni Main Apdex showing recovery
- 17:57 - Peak database request volume identified
- 18:36 - Status page marked as resolved
- 19:11 - Incident moved to post-incident documentation phase
Timeline - 2025-10-09 - INC-4641
Incident Timeline
- 17:13 - Impact began
- 17:14 - Incident reported by Stephanie Jackson (initially Severity 2)
- 17:16 - Escalated to Severity 1
- 17:19 - Status page updated; user traffic stopped but database still recovering
- 17:23 - Offending user account blocked
- 17:28 - Patroni showing recovery (not yet 100%)
- 17:31 - Status page updated to "Monitoring" - issue mitigated
- 17:46 - Confirmed root cause: new CI pipeline created
- 18:12 - Status page marked as resolved
- 18:19 - Inciden t moved to post-incident documentation phase
Investigation Notes
Internal investigation was conducted in https://gitlab.com/gitlab-org/gitlab/-/issues/575006 to ensure the privacy of customers and users.
Follow-ups
INC-4589
- Investigate the root cause of patroni-main behaviour during the S1 incident Completed
- Runners team to investigate runners behaviour during this incident
INC-4641
- Investigate the root cause of patroni-main behaviour during the S1 incident Completed
- Cache member roles assigned to user groups
- For endpoints that perform search functions there should be measures...
- Create rate limit for the specific endpoint in the incident
- Re-evaluate the rate limit configuration
- Tune the SQL query under the endpoint from this incident
Process (this section to be removed once completed)
For the person opening the Incident Review
-
Set the title to Incident Review: (Incident issue name) -
Assign a Service::*label (most likely matching the one on the incident issue) -
Set a Severity::*label which matches the incident -
In the Key Informationsection, make sure to include a link to the incident issue -
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.
For the assigned DRI
-
Fill in the remaining fields in the Key Informationsection, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find. -
If there are metrics showing Customers AffectedorRequests Affected, link those metrics in those fields -
Create a few short sentences in the Summary section summarizing what happened (TL;DR) -
Link any corrective actions and describe any other actions or outcomes from the incident -
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported? -
Once discussion wraps up in the comments, summarize any takeaways in the details section -
Review the incident timeline for any sensitive information and ensure any description history is cleaned up if any needs to be removed. -
Have this issue reviewed by the Production Engineering Senior Manager or Infrastructure Platforms Director before making it public. -
Close the review before the due date -
Go back to the incident channel or page and close out the remaining post-incident tasks
Review Guidelines
This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.