Incident Review: Open file descriptor utilization near capacity on ci-runners main stage
INC-4093: Open file descriptor utilization near capacity on ci-runners main stage
Generated by Jan Provazník on 24 Sep 2025 07:59. All timestamps are local to Etc/UTC
Key Information
Metric | Value |
---|---|
Customers Affected | 29.3K users, 4534 top-level namespaces |
Requests Affected | 16M requests from 217K CI jobs |
Incident Severity | Severity 1 (Critical) |
Impact Start Time | Tue, 23 Sep 2025 00:58:00 UTC |
Impact End Time | Tue, 23 Sep 2025 02:45:10 UTC |
Total Duration | 1 day, 6 hours |
Link to Incident Issue | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20592 |
Summary
Problem: Open file descriptor usage on CI runners (main stage) exceeded capacity, leading to job processing failures and a backlog of unprocessed jobs.
Impact: Customers across GitLab.com were unable to run CI/CD jobs, as most jobs were not being picked up or completed. This caused widespread disruption to customer pipelines. Job success rates are now improving and HTTP 429 errors have dropped significantly.
Causes: An incorrectly scoped rate limit on the PATCH /jobs/:id/trace
endpoint caused runners to exhaust available file descriptors while waiting to be able to send final job logs.
Response strategy: We rolled back the API change, which reduced open file descriptor utilization and restored job processing. We prepared a revert MR for the root cause of the issue and deployed it to production.
What went well?
Use this section to highlight what went well during the incident. Capturing this helps understand informal processes and expertise, and enables undocumented knowledge to be shared.
-
We quickly discovered a recently changed feature flag through the event log which enabled fast mitigation of the impact, as well as pulling in the engineer involved to further diagnose.
-
We escalated through dev escalations, which brought in Person X. They knew that Person Y had expertise with the component in question, which enabled faster diagnosis.
What was difficult?
Use this section to highlight opportunities for improvement discovered during the incident. Capturing this helps understand informal processes and expertise, and enables undocumented knowledge to be shared. If the improvement seems like a simplest change, consider adding it as a corrective action above instead. Think about how to improve response next time, and consider any patterns pointing to broader issues, like “key person risk.”
- The runbooks/playbooks for this service are out of date and did not contain the information necessary to troubleshoot the incident.
- The incident happened at a time when nobody with expertise on the service was available.
Investigation Details
Timeline
Incident Timeline
2025-09-23
00:59:00 Impact started at
Custom timestamp "Impact started at" occurred
00:59:13 Incident reported in triage by Prometheus Alertmanager alert
Prometheus Alertmanager alert reported the incident
Severity: None
Status: Triage
01:03:28 Image posted by Tarun Khandelwal
Tarun Khandelwal posted an image to the channel
It seems the open_fds on all of the CI runners have shot up to 100% after the recent deployment of: 18.5.202509221806-1eb7c144408.7514e4448ba
source (https://dashboards.gitlab.net/goto/pLUQQ9CHg?orgId=1)
01:04:18 Incident accepted
Tarun Khandelwal shared an update
Severity: None → Severity 3 (Medium)
Status: Triage → Investigating
01:08:24 Message from Siddharth Kannan
Siddharth Kannan's message was pinned by Tarun Khandelwal
429 could be related to this commit: https://gitlab.com/gitlab-org/security/gitlab/-/commit/018424ae9b3139ba73baf744ca065d487a723cd5 This was released as part of the deployment.
Unclear if it could be related to the open FDs incident though.
01:10:55 Message from Siddharth Kannan
Siddharth Kannan's message was pinned by Tarun Khandelwal
The number of open FDs increasing seems like a timeout that was increased from the previous value, or a new timeout which was introduced causing requests to take longer?
Could it be related to this: gitlab-org/gitlab@dd9f088b (MR: gitlab-org/gitlab!204265 (merged))
01:13:33 Image posted by Anton Starovoytov
Anton Starovoytov posted an image to the channel
"too many open files" errors increase starting from 00:55 UTC on the runners logs. Most likely a deployment issue
https://log.gprd.gitlab.net/app/r/s/axJqz
01:14:44 Message from Kent Ballon
Kent Ballon's message was pinned by Tarun Khandelwal
It feels like a lot of GitLab.com (http://GitLab.com) runners are experiencing getting stuck and not starting properly.
01:20:40 Severity upgraded from Severity 3 (Medium) → Severity 2 (High)
Tarun Khandelwal shared an update
Severity: Severity 3 (Medium) → Severity 2 (High)
01:21:42 Image posted by Anton Starovoytov
Anton Starovoytov posted an image to the channel
jobs succeeded dropped significantly , so the customers are affected:
source (https://log.gprd.gitlab.net/app/r/s/exbZQ)
01:22:11 Image posted by Zoe Braddock
Zoe Braddock posted an image to the channel
So I do think this show that the most popular shard saas-linux-small-amd64 is down, and I reccomend we increase the severity of the incident and rollback the change.
https://dashboards.gitlab.net/goto/Owmrl9jHg?orgId=1
01:22:52 Message from Thiago Figueiro
Thiago Figueiro pinned their own message
<!subteam^S069XU8KYGY> thread
01:23:08 Message from Zoe Braddock
Zoe Braddock pinned their own message
@siddharthkannan is rolling back the change
01:38:35 Message from Siddharth Kannan
Siddharth Kannan's message was pinned by Thiago Figueiro
Initiating Rollback
Runbook: https://gitlab.com/gitlab-org/release/docs/-/blob/master/runbooks/rollback-a-deployment.md
01:39:42 Image posted by Zoe Braddock
Zoe Braddock posted an image to the channel
Have a look at the large increase in 429s
https://dashboards.gitlab.net/goto/qLm9urCHR?orgId=1
01:39:42 Image posted by Zoe Braddock
Zoe Braddock posted an image to the channel
Have a look at the large increase in 429s
https://dashboards.gitlab.net/goto/qLm9urCHR?orgId=1
01:40:03 Image posted by Anton Starovoytov
Anton Starovoytov posted an image to the channel
429 on the Runners: https://log.gprd.gitlab.net/app/r/s/6yQw7
01:40:03 Image posted by Anton Starovoytov
Anton Starovoytov posted an image to the channel
429 on the Runners: https://log.gprd.gitlab.net/app/r/s/6yQw7
01:43:36 Message from Tarun Khandelwal
Tarun Khandelwal pinned their own message
gitlab-org/gitlab!204751 (merged)
01:47:51 Message from Siddharth Kannan
Siddharth Kannan's message was pinned by Tarun Khandelwal
We have agreed on the incident bridge that we will be waiting for AppSec approval that rolling back the S1 SIRT fix is OK before proceeding with the rollback.
01:49:39 Message from Siddharth Kannan
Siddharth Kannan pinned their own message
INTERNAL NOTE ONLY: The diff which will be rolled back contains the fix MR for S1: https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/5347
$ g log 81b2b55cdd4a123a8a49f77ab3202e8cbd649931...1eb7c144408f32a8b4cba043f51662c28316579d | rg 1eb7c144408f32a8b4cba043f51662c28316579d
commit 1eb7c144408f32a8b4cba043f51662c28316579d
01:51:25 Update shared
Tarun Khandelwal (via @incident) shared an update
CI runners on the main stage became saturated with open file descriptors, causing jobs on the most popular shard to stop processing. As a result, CI/CD jobs are not being picked up or completed for many customers on GitLab.com. Customers are experiencing widespread failures, with a sharp drop in successfully executed jobs and an increase in HTTP 429 errors.
The incident has been traced to recent changes to the PUT /jobs/:id
API endpoint, which increased request volume and overloaded Redis, leading to file descriptor exhaustion. An API rate limit deployed to mitigate a separate issue has also contributed to the current impact. Both deployments are under review for rollback.
We are preparing a rollback to revert the identified merge requests (MR !204265, MR !204751)
02:03:56 Message from Katherine Wu
Katherine Wu's message was pinned by Siddharth Kannan
Ok please proceed with the rollback
02:05:00 Message from Siddharth Kannan
Siddharth Kannan pinned their own message
Thank you! Starting the rollback pipeline
02:05:19 Message from Zoe Braddock
Zoe Braddock's message was pinned by Tarun Khandelwal
We have permission from the security team to proceed with the rollback - given verbally in the call.
02:06:30 Severity upgraded from Severity 2 (High) → Severity 1 (Critical)
Thiago Figueiro shared an update
Severity: Severity 2 (High) → Severity 1 (Critical)
02:08:42 Message from Siddharth Kannan
Siddharth Kannan pinned their own message
Rollback pipeline started: https://gitlab.slack.com/archives/C0139MAV672/p1758593279077899
/chatops run deploy 18.5.202509220906-81b2b55cdd4.09ed5644993 gprd --rollback
https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/5003945
02:09:07 Message from Thiago Figueiro
Thiago Figueiro pinned their own message
I have bumped this to S1 because of the high impact to customers. Almost no CI Jobs are running, and it’s also causing a spike on Support.
02:27:52 Message from Hordur Yngvason
Hordur Yngvason's message was pinned by Thiago Figueiro
gitlab-org/gitlab!204709 (merged) probably needs to be reverted before gitlab-org/gitlab!204265 (merged)
02:33:09 Message from Zoe Braddock
Zoe Braddock pinned their own message
Here are the key metrics we are watching as this incident recovers. cc @hordur
- Successfully executed jobs: https://log.gprd.gitlab.net/app/r/s/WGIqV
- Job Failures: https://log.gprd.gitlab.net/app/r/s/SrZNr
- 429s: https://dashboards.gitlab.net/goto/qLm9urCHR?orgId=1
- Open File Descriptors: https://dashboards.gitlab.net/goto/pplqq9jHR?orgId=1
02:35:48 Message from Zoe Braddock
Zoe Braddock's message was pinned by Tarun Khandelwal
Here is the pipeline we are watching to see the progression of the rollback: https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/pipelines/5003960
We are not expecting to see significant changes in the metrics until the rollback has been deployed to atleast 1 zonal kubernetes cluster
02:44:28 Image posted by Zoe Braddock
Zoe Braddock posted an image to the channel
We are starting to see very early signs of a decrease in open_fds component saturation on the runner managers (thats a good thing)
https://dashboards.gitlab.net/goto/7Kybe9CHR?orgId=1
02:45:03 Message from Siddharth Kannan
Siddharth Kannan's message was pinned by Tarun Khandelwal
@hordur This is the MR which reverts all three of those MRs, is someone from Verify available to take a look at it?
gitlab-org/gitlab!205960 (merged)
I had to fix one merge conflict though. I put the detail for the conflict resolution in the MR description, if that helps.
02:52:05 Update shared
Thiago Figueiro (via @incident) shared an update
We've deployed the rollback to address the file descriptor saturation on CI runners, and early signs of recovery are visible.
Open file descriptor utilization on runner managers is decreasing, and the number of succeeded jobs is rising. The rate of HTTP 429 errors has dropped significantly in the last several minutes, indicating the system is stabilizing.
We're continuing to monitor metrics and will confirm full recovery once the backlog of jobs is cleared and job success rates return to normal.
02:53:18 Image posted by Zoe Braddock
Zoe Braddock posted an image to the channel
The 429s are gone
https://log.gprd.gitlab.net/app/r/s/7emKT
02:56:43 Message from Zoe Braddock
Zoe Braddock's message was pinned by Tarun Khandelwal
For the retro - we realized very early on in this incident that the problem was likely linked to these two MRs - gitlab-org/gitlab!204751 (merged) and gitlab-org/gitlab!204265 (merged). We focused on rolling back these changes, however it is possible that we might have been able to resolve this incident faster if we had logged into the Gitlab.com (http://Gitlab.com) application and manually increased these rate limits to a very high number. This is not a criticism of the responders - just an interesting observation for next time.
03:01:48 Status changed from Investigating → Fixing
Thiago Figueiro shared an update
Status: Investigating → Fixing
03:01:48 Identified at
Custom timestamp "Identified at" occurred
03:02:11 Status changed from Fixing → Monitoring
Thiago Figueiro shared an update
Status: Fixing → Monitoring
03:02:11 Fixed at
Custom timestamp "Fixed at" occurred
03:02:11 Monitoring at
Custom timestamp "Monitoring at" occurred
03:06:11 Message from Zoe Braddock
Zoe Braddock pinned their own message
@anton Starovoytov pointed out that its just as likely we might have found that the problem was with the relevant code paths rather than the specific values, or we might have needed to wait for the same SIRT approval, and rollback was likely required anyway.
I guess it is impossible to time travel back anyway
03:10:00 Message from Siddharth Kannan
Siddharth Kannan's message was pinned by Tarun Khandelwal
Thank you for the reviews! I have set the revert MR gitlab-org/gitlab!205960 (merged) to Merge when pipeline succeeds
03:26:03 Message from Chris Stone
Chris Stone's message was pinned by Thiago Figueiro
Hi folks. As an aside and not to detract focus from the immediate problem, I was wondering if we'll see an issue with compute minutes being consumed due to this incident. I noticed a job of mine still running for over 2 hours with a 1 hour timeout. I recall we had a similar issue before, just wanted to confirm if we will likely see the same resulting in refund requests...
Support response issue (https://gitlab.com/gitlab-com/support/support-team-meta/-/issues/7191#note_2770040310)
04:50:32 Message from Thiago Figueiro
Thiago Figueiro pinned their own message
@jprovaznik is the next IMOC.
Jan, this incident is now resolved. There are follow-ups:
- A revert MR (gitlab-org/gitlab!205960 (merged)) needs to be deployed because it contains an unrelated S1 SIRT fix. This is in progress.
- We need to investigate whether CI minutes were consumed https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20592#note_2770090148 and, if so, start a credit for affected users.
- A single customer has reported that their self-hosted runner still can’t run jobs. Support is following this up gitlab-org/gitlab!205960 (comment 2770147070)
04:52:08 Incident resolved and entered the post-incident flow
Thiago Figueiro shared an update
Status: Monitoring → Documenting
06:29:19 Message from Siddharth Kannan
Siddharth Kannan's message was pinned by Jan Provazník
A revert MR (gitlab-org/gitlab!205960 (merged)) needs to be deployed because it contains an unrelated S1 SIRT fix. This is in progress. The revert MR does not contain a SIRT fix itself.
It is a revert of the three MRs which caused this incident. The description of gitlab-org/gitlab!205960 (merged) links the three MRs that we believe caused this issue.
We need to deploy gitlab-org/gitlab!205960 (merged) because we need to deploy a separate MR, which is a SIRT FIX
07:05:25 Message from Jan Provazník
Jan Provazník pinned their own message
Re-opening the incident until the revert MR is merged and deployed - I think it's more accurate than current resolved status
07:06:06 Incident re-opened
Jan Provazník shared an update
Status: Documenting → Monitoring
07:06:06 Documented at
Custom timestamp "Documented at" occurred
07:08:30 Update shared
Jan Provazník shared an update
The incident is mitigated on .com because we rolled back previous deployment. We are waiting for gitlab-org/gitlab!205960 (merged) (which reverts root cause of the issue) to get merged and deployed to .com.
07:41:34 Update shared
Jan Provazník (via @incident) shared an update
The incident has been mitigated on GitLab.com by rolling back the previous deployment, which addressed the open file descriptor saturation and restored job processing for CI runners.
We are now monitoring the system while we wait for the revert merge request !205960, which undoes the root cause changes, to be merged and deployed to production. No further customer impact is expected, but we will confirm full resolution once the revert is live and metrics remain stable.
2h later
09:55:38 Update shared
Jan Provazník (via @incident) shared an update
The revert merge request !205960, which undoes the root cause of the CI runners file descriptor saturation, has been cherry-picked into the latest auto-deploy branch and is scheduled to begin deployment within the next hour.
We continue to monitor system health and job processing on GitLab.com. No new customer impact has been reported since the initial mitigation. We will confirm full resolution after the revert is deployed and all key metrics remain stable.
7h later
17:06:20 Update shared
Terri Chu (via @incident) shared an update
The revert merge request !205960 that addresses the CI runners file descriptor saturation was deployed to gprd-cny, but this environment was drained due to a separate incident affecting Terraform state endpoints. Deployment to production is pending until that issue is resolved and gprd-cny is enabled again.
We continue to monitor for any new customer impact. Once the revert is promoted to production and metrics confirm stability, we will confirm full resolution.
2025-09-24
04:09:28 Message from Thiago Figueiro
Thiago Figueiro pinned their own message
This incident was the root cause of the S3 incident #inc-4132-merge-request-home-page-is-empty
The rate-limiting that caused this incident also prevented a cache file from being uploaded for a GitLab image build (https://dev.gitlab.org/gitlab/gitlab-ee/-/jobs/30827983), so we built and deployed and image with broken assets.
Stan created a fix to prevent the same problem on the image creation side gitlab-org/gitlab!206131 (merged)
04:10:38 Message from Siddharth Kannan
Siddharth Kannan's message was pinned by Jan Provazník
A package with the revert MR is being deployed to gprd
now:
https://gitlab.slack.com/archives/C8PKBH3M5/p1758685489162579
05:37:03 Message from Siddharth Kannan
Siddharth Kannan's message was pinned by Jan Provazník
The revert MR was released to gprd
as part of this package (https://gitlab.slack.com/archives/C8PKBH3M5/p1758690483349359?thread_ts=1758685489.162579&cid=C8PKBH3M5) at 2025-09-24 05:08 UTC
07:11:30 Incident resolved and entered the post-incident flow
Jan Provazník shared an update
Status: Monitoring → Documenting
The revert MR has been deployed to production and all metrics from https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20592#note_2770031211 are OK. Getting this revert MR took a bit longer because of unrelated incidents which caused delay with deployment to production.
Investigation Notes
Timing-Dependent Vulnerability: The incident occurred during UTC night hours when even though job load was lower, idle instance pools were also scaled down, providing less buffer capacity to absorb the impact and fewer idle instances available for recovery.
Cascading Failure Pattern: The incident followed a specific cascade:
- Rate limiting causes trace update failures
- File descriptors accumulate from multiple sources: retry mechanisms, elevated memory/CPU/IOPS usage, and jobs stuck in various retry loops
- Docker Machine operations start failing due to FD exhaustion
- Autoscaling fails, creating
no_free_executor
conditions - System enters death spiral as Docker Machine keeps trying to create/remove instances
Follow-ups
Follow-up
Owner
2025-09-23: Open file descriptor utilization near capacity on ci-runners main stage
Unassigned
Assess incident impact on customer compute minute usage and refunds
Nicole Williams
Interesting to know that despite the very large decrease in successfully...
Zoe Braddock
Review Guidelines
This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.
For the person opening the Incident Review
-
Set the title to Incident Review: (Incident issue name)
-
Assign a Service::*
label (most likely matching the one on the incident issue) -
Set a Severity::*
label which matches the incident -
In the Key Information
section, make sure to include a link to the incident issue -
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.
For the assigned DRI
-
Fill in the remaining fields in the Key Information
section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find. -
If there are metrics showing Customers Affected
orRequests Affected
, link those metrics in those fields -
Create a few short sentences in the Summary section summarizing what happened (TL;DR) -
Link any corrective actions and describe any other actions or outcomes from the incident -
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported? -
Once discussion wraps up in the comments, summarize any takeaways in the details section -
If the incident timeline does not contain any sensitive information and this review can be made public, turn off the issue's confidential mode and link this review to the incident issue. -
Close the review before the due date -
Go back to the incident channel or page and close out the remaining post-incident tasks