Incident Review: GitLab.com slow and not loading some elements

INC-4589: GitLab.com slow and not loading some elements

INC-4641: Patroni main SLI dropping

Generated by André Luís on 8 Oct 2025 14:12. All timestamps are local to Etc/UTC

Key Information

Metric INC-4589 INC-4641
Customers Affected GitLab.com users, 5 customer tickets GitLab.com users
Requests Affected 86% of all traffic dropped 89% of all traffic dropped
Incident Severity Severity 1 (Critical) Severity 1 (Critical)
Impact Start Time Tue, 07 Oct 2025 17:15:00 UTC Thu, 09 Oct 2025 17:13:00 UTC
Impact End Time Tue, 07 Oct 2025 17:30:00 UTC (15 minutes) Thu, 09 Oct 2025 17:33:00 UTC (18 minutes)
Total Duration 1 hour, 47 minutes 1 hour, 4 minutes
Link to Incident Issue https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20688 https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20705

Summary

Problem: On October 7 and 9, GitLab.com experienced complete database saturation across the patroni-main cluster due to an exceptionally high volume of API requests from a scheduled CI pipeline querying large organizational projects.

Impact: Between 17:15 and 17:30 UTC on October 7 and 17:13 and 17:33 UTC on October 9, GitLab.com was unavailable for users and many experienced elevated 502 errors. Other users experienced slow performance and elements that would not load. Five customer tickets were opened about the disruption.

Causes: The root cause was an expensive database query joining over a dozen tables that, when executed at high concurrency, created severe LwLock:BufferMapping contention and CPU saturation on replica nodes, eventually cascading to the primary database node. Contributing factors included insufficient rate limiting for this computationally expensive endpoint and lack of query optimization for large organizational hierarchies. A single user triggered the high volume of requests through a scheduled pipeline.

Response strategy: For the first incident, service recovered without direct intervention but the root cause was not immediately apparent. Internal investigation was still underway when the second incident occurred. The second incident was resolved by blocking the user who triggered the pipeline and communicating directly with them to disable the scheduled pipeline. Immediate actions include implementing endpoint-specific rate limiting for affected APIs and reviewing rate limit configurations for high-volume consumers. Short-term corrective actions focus on optimizing the query in question to reduce join complexity, considering tiered rate limiting strategies that account for query cost, and implementing circuit breakers for pathological query patterns.

What went well?

  1. We quickly identified drops in traffic and were able to bring in several team members from different departments to investigate from multiple angles.
  2. The investigation after the initial incident was thorough and team members swarmed on identifying the root cause as well as corrective actions.
  3. Some team members noticed the pattern of the recurring pipeline right before the second incident, which allowed us to know immediately what was happening and how to approach stopping it.

What was difficult?

  1. We could see the symptoms of the incident everywhere in our observability, but still had difficulty diagnosing the problem.
  2. While we reached out to the customer after the first incident, we had trouble getting a response. While we did not realize this was an automated pipeline at the time, that delay in communication allowed the second incident to happen.
  3. The endpoint and underlying query in question are normally performant under this load, but this was an edge case of the query that allowed for poor performance.

Combined RCA

Root Cause

The incident was triggered by an exceptionally high volume of API requests to an endpoint for projects within a large organizational namespace. A single user made approximately 4,000 requests per minute over a 10-minute period, querying member information across three projects.

The specific query executed by this endpoint proved particularly expensive for organizations of this scale due to:

  1. Complex Query Structure: The query joins 16 tables to aggregate member information across group hierarchies, shared groups, and project authorizations
  2. Nested Loop Amplification: The execution plan used nested loop joins that repeatedly accessed the same buffer pages for namespaces, group_group_links, and user_group_member_roles tables
  3. Lock Contention: Each query iteration acquired and released BufferMapping LWLocks hundreds of thousands of times, creating severe contention when multiple queries ran concurrently
  4. CPU Saturation: Individual queries consumed 7-8 seconds of CPU time, and the high concurrency level (4,000 req/min) saturated all available database resources

Cascading Failure Mechanism

  1. Replica Saturation: The initial query load hit replica nodes, causing CPU saturation and LwLock:BufferMapping contention
  2. Replication Lag: As replicas became CPU-bound, replication lag increased
  3. Load Balancer Failover: Rails database load balancing detected slow replica responses and shifted traffic to the primary node
  4. Primary Saturation: The primary node became overwhelmed with read traffic it wasn't designed to handle
  5. Connection Pool Exhaustion: PgBouncer connection pools saturated, starving other queries and causing widespread timeouts
  6. Platform Impact: All endpoints sharing the saturated connection pool experienced degraded performance

Contributing Factors

  1. Insufficient Rate Limiting: The current rate limiting configuration for this endpoint was insufficient to prevent a single user from generating enough load to saturate database resources
  2. Query Optimization: The query involved was not optimized for organizations with complex group hierarchies and large member counts
  3. Lack of Query-Specific Safeguards: No specific rate limits or query timeouts were in place for this expensive endpoint

Resolution

The incident self-resolved as the requesting user's API calls completed their pagination cycle and traffic returned to normal levels.

Corrective Actions

Immediate:

Short-term:

Long-term:

  • Establish more granular, tiered rate limiting strategies that balance customer needs with platform protection
  • Review and optimize other endpoints that perform complex multi-table joins
  • Enhance monitoring and alerting for LwLock contention patterns
  • Implement circuit breakers for queries that exhibit pathological performance characteristics

Lessons Learned

  1. Rate Limiting Must Scale with Query Cost: Rate limits should account for the computational cost of operations, not just request volume. Expensive queries require more conservative limits.
  2. Query Complexity Compounds Under Load: Queries that join many tables can exhibit non-linear performance degradation under concurrent load due to lock contention.
  3. Database Load Balancing Can Amplify Issues: When replicas fail health checks, shifting all traffic to the primary can accelerate rather than mitigate incidents.
  4. Observability Matters: Seeing a problem doesn't mean being able to understand a problem. Observability needs to be designed to drive diagnosis in addition to detection.

Investigation Details

Timeline - 2025-10-07 - INC-4589

Incident Timeline

  • 17:15 - Impact began
  • 17:23 - Incident reported by Cleveland Bledsoe Jr (initially Severity 2)
  • 17:28 - Escalated to Severity 1; status page updated; on-call teams pinged
  • 17:39 - Patroni Main Apdex showing recovery
  • 17:57 - Peak database request volume identified
  • 18:36 - Status page marked as resolved
  • 19:11 - Incident moved to post-incident documentation phase
Timeline - 2025-10-09 - INC-4641

Incident Timeline

  • 17:13 - Impact began
  • 17:14 - Incident reported by Stephanie Jackson (initially Severity 2)
  • 17:16 - Escalated to Severity 1
  • 17:19 - Status page updated; user traffic stopped but database still recovering
  • 17:23 - Offending user account blocked
  • 17:28 - Patroni showing recovery (not yet 100%)
  • 17:31 - Status page updated to "Monitoring" - issue mitigated
  • 17:46 - Confirmed root cause: new CI pipeline created
  • 18:12 - Status page marked as resolved
  • 18:19 - Inciden t moved to post-incident documentation phase

Investigation Notes

Internal investigation was conducted in https://gitlab.com/gitlab-org/gitlab/-/issues/575006 to ensure the privacy of customers and users.

Follow-ups

INC-4589

INC-4641


Process (this section to be removed once completed)

For the person opening the Incident Review

  • Set the title to Incident Review: (Incident issue name)
  • Assign a Service::* label (most likely matching the one on the incident issue)
  • Set a Severity::* label which matches the incident
  • In the Key Information section, make sure to include a link to the incident issue
  • Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.

For the assigned DRI

  • Fill in the remaining fields in the Key Information section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find.
  • If there are metrics showing Customers Affected or Requests Affected, link those metrics in those fields
  • Create a few short sentences in the Summary section summarizing what happened (TL;DR)
  • Link any corrective actions and describe any other actions or outcomes from the incident
  • Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?
  • Once discussion wraps up in the comments, summarize any takeaways in the details section
  • Review the incident timeline for any sensitive information and ensure any description history is cleaned up if any needs to be removed.
  • Have this issue reviewed by the Production Engineering Senior Manager or Infrastructure Platforms Director before making it public.
  • Close the review before the due date
  • Go back to the incident channel or page and close out the remaining post-incident tasks

Review Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

Edited by Steve Abrams