Incident Review: Open file descriptor utilization near capacity on ci-runners main stage

INC-4093: Open file descriptor utilization near capacity on ci-runners main stage

Generated by Jan Provazník on 24 Sep 2025 07:59. All timestamps are local to Etc/UTC

Key Information

Metric	Value
Customers Affected	29.3K users, 4534 top-level namespaces
Requests Affected	16M requests from 217K CI jobs
Incident Severity	Severity 1 (Critical)
Impact Start Time	Tue, 23 Sep 2025 00:58:00 UTC
Impact End Time	Tue, 23 Sep 2025 02:45:10 UTC
Total Duration	1 day, 6 hours
Link to Incident Issue	https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20592

Summary

Problem: Open file descriptor usage on CI runners (main stage) exceeded capacity, leading to job processing failures and a backlog of unprocessed jobs.

Impact: Customers across GitLab.com were unable to run CI/CD jobs, as most jobs were not being picked up or completed. This caused widespread disruption to customer pipelines. Job success rates are now improving and HTTP 429 errors have dropped significantly.

Causes: An incorrectly scoped rate limit on the PATCH /jobs/:id/trace endpoint caused runners to exhaust available file descriptors while waiting to be able to send final job logs.

Response strategy: We rolled back the API change, which reduced open file descriptor utilization and restored job processing. We prepared a revert MR for the root cause of the issue and deployed it to production.

What went well?

Use this section to highlight what went well during the incident. Capturing this helps understand informal processes and expertise, and enables undocumented knowledge to be shared.

We quickly discovered a recently changed feature flag through the event log which enabled fast mitigation of the impact, as well as pulling in the engineer involved to further diagnose.
We escalated through dev escalations, which brought in Person X. They knew that Person Y had expertise with the component in question, which enabled faster diagnosis.

What was difficult?

Use this section to highlight opportunities for improvement discovered during the incident. Capturing this helps understand informal processes and expertise, and enables undocumented knowledge to be shared. If the improvement seems like a simplest change, consider adding it as a corrective action above instead. Think about how to improve response next time, and consider any patterns pointing to broader issues, like “key person risk.”

The runbooks/playbooks for this service are out of date and did not contain the information necessary to troubleshoot the incident.
The incident happened at a time when nobody with expertise on the service was available.

Investigation Details

Timeline

Incident Timeline

2025-09-23

00:59:00 Impact started at

Custom timestamp "Impact started at" occurred

00:59:13 Incident reported in triage by Prometheus Alertmanager alert

Prometheus Alertmanager alert reported the incident

Severity: None

Status: Triage

01:03:28 Image posted by Tarun Khandelwal

Tarun Khandelwal posted an image to the channel

It seems the open_fds on all of the CI runners have shot up to 100% after the recent deployment of: 18.5.202509221806-1eb7c144408.7514e4448ba

source (https://dashboards.gitlab.net/goto/pLUQQ9CHg?orgId=1)

01:04:18 Incident accepted

Tarun Khandelwal shared an update

Severity: ~~None~~ → Severity 3 (Medium)

Status: ~~Triage~~ → Investigating

01:08:24 Message from Siddharth Kannan

Siddharth Kannan's message was pinned by Tarun Khandelwal

429 could be related to this commit: https://gitlab.com/gitlab-org/security/gitlab/-/commit/018424ae9b3139ba73baf744ca065d487a723cd5 This was released as part of the deployment.

Unclear if it could be related to the open FDs incident though.

01:10:55 Message from Siddharth Kannan

Siddharth Kannan's message was pinned by Tarun Khandelwal

The number of open FDs increasing seems like a timeout that was increased from the previous value, or a new timeout which was introduced causing requests to take longer?

Could it be related to this: gitlab-org/gitlab@dd9f088b (MR: gitlab-org/gitlab!204265 (merged))

01:13:33 Image posted by Anton Starovoytov

Anton Starovoytov posted an image to the channel

"too many open files" errors increase starting from 00:55 UTC on the runners logs. Most likely a deployment issue

https://log.gprd.gitlab.net/app/r/s/axJqz

01:14:44 Message from Kent Ballon

Kent Ballon's message was pinned by Tarun Khandelwal

It feels like a lot of GitLab.com (http://GitLab.com) runners are experiencing getting stuck and not starting properly.

01:20:40 Severity upgraded from Severity 3 (Medium) → Severity 2 (High)

Tarun Khandelwal shared an update

Severity: ~~Severity 3 (Medium)~~ → Severity 2 (High)

01:21:42 Image posted by Anton Starovoytov

Anton Starovoytov posted an image to the channel

jobs succeeded dropped significantly , so the customers are affected:

source (https://log.gprd.gitlab.net/app/r/s/exbZQ)

01:22:11 Image posted by Zoe Braddock

Zoe Braddock posted an image to the channel

So I do think this show that the most popular shard saas-linux-small-amd64 is down, and I reccomend we increase the severity of the incident and rollback the change.

https://dashboards.gitlab.net/goto/Owmrl9jHg?orgId=1

01:22:52 Message from Thiago Figueiro

Thiago Figueiro pinned their own message

<!subteam^S069XU8KYGY> thread 🧵

01:23:08 Message from Zoe Braddock

Zoe Braddock pinned their own message

@siddharthkannan is rolling back the change

01:38:35 Message from Siddharth Kannan

Siddharth Kannan's message was pinned by Thiago Figueiro

Initiating Rollback

Runbook: https://gitlab.com/gitlab-org/release/docs/-/blob/master/runbooks/rollback-a-deployment.md

01:39:42 Image posted by Zoe Braddock

Zoe Braddock posted an image to the channel

Have a look at the large increase in 429s

https://dashboards.gitlab.net/goto/qLm9urCHR?orgId=1

01:39:42 Image posted by Zoe Braddock

Zoe Braddock posted an image to the channel

Have a look at the large increase in 429s

https://dashboards.gitlab.net/goto/qLm9urCHR?orgId=1

01:40:03 Image posted by Anton Starovoytov

Anton Starovoytov posted an image to the channel

429 on the Runners: https://log.gprd.gitlab.net/app/r/s/6yQw7

01:40:03 Image posted by Anton Starovoytov

Anton Starovoytov posted an image to the channel

429 on the Runners: https://log.gprd.gitlab.net/app/r/s/6yQw7

01:43:36 Message from Tarun Khandelwal

Tarun Khandelwal pinned their own message

gitlab-org/gitlab!204751 (merged)

01:47:51 Message from Siddharth Kannan

Siddharth Kannan's message was pinned by Tarun Khandelwal

We have agreed on the incident bridge that we will be waiting for AppSec approval that rolling back the S1 SIRT fix is OK before proceeding with the rollback.

01:49:39 Message from Siddharth Kannan

Siddharth Kannan pinned their own message

INTERNAL NOTE ONLY: The diff which will be rolled back contains the fix MR for S1: https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/5347

$ g log 81b2b55cdd4a123a8a49f77ab3202e8cbd649931...1eb7c144408f32a8b4cba043f51662c28316579d | rg 1eb7c144408f32a8b4cba043f51662c28316579d
commit 1eb7c144408f32a8b4cba043f51662c28316579d

01:51:25 Update shared

Tarun Khandelwal (via @incident) shared an update

CI runners on the main stage became saturated with open file descriptors, causing jobs on the most popular shard to stop processing. As a result, CI/CD jobs are not being picked up or completed for many customers on GitLab.com. Customers are experiencing widespread failures, with a sharp drop in successfully executed jobs and an increase in HTTP 429 errors.

The incident has been traced to recent changes to the PUT /jobs/:id API endpoint, which increased request volume and overloaded Redis, leading to file descriptor exhaustion. An API rate limit deployed to mitigate a separate issue has also contributed to the current impact. Both deployments are under review for rollback.

We are preparing a rollback to revert the identified merge requests (MR !204265, MR !204751)

02:03:56 Message from Katherine Wu

Katherine Wu's message was pinned by Siddharth Kannan

Ok please proceed with the rollback

02:05:00 Message from Siddharth Kannan

Siddharth Kannan pinned their own message

Thank you! Starting the rollback pipeline

02:05:19 Message from Zoe Braddock

Zoe Braddock's message was pinned by Tarun Khandelwal

We have permission from the security team to proceed with the rollback - given verbally in the call.

02:06:30 Severity upgraded from Severity 2 (High) → Severity 1 (Critical)

Thiago Figueiro shared an update

Severity: ~~Severity 2 (High)~~ → Severity 1 (Critical)

02:08:42 Message from Siddharth Kannan

Siddharth Kannan pinned their own message

Rollback pipeline started: https://gitlab.slack.com/archives/C0139MAV672/p1758593279077899

/chatops run deploy 18.5.202509220906-81b2b55cdd4.09ed5644993 gprd --rollback

https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/5003945

02:09:07 Message from Thiago Figueiro

Thiago Figueiro pinned their own message

I have bumped this to S1 because of the high impact to customers. Almost no CI Jobs are running, and it’s also causing a spike on Support.

02:27:52 Message from Hordur Yngvason

Hordur Yngvason's message was pinned by Thiago Figueiro

gitlab-org/gitlab!204709 (merged) probably needs to be reverted before gitlab-org/gitlab!204265 (merged)

02:33:09 Message from Zoe Braddock

Zoe Braddock pinned their own message

Here are the key metrics we are watching as this incident recovers. cc @hordur

Successfully executed jobs: https://log.gprd.gitlab.net/app/r/s/WGIqV
Job Failures: https://log.gprd.gitlab.net/app/r/s/SrZNr
429s: https://dashboards.gitlab.net/goto/qLm9urCHR?orgId=1
Open File Descriptors: https://dashboards.gitlab.net/goto/pplqq9jHR?orgId=1

02:35:48 Message from Zoe Braddock

Zoe Braddock's message was pinned by Tarun Khandelwal

Here is the pipeline we are watching to see the progression of the rollback: https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/pipelines/5003960

We are not expecting to see significant changes in the metrics until the rollback has been deployed to atleast 1 zonal kubernetes cluster

02:44:28 Image posted by Zoe Braddock

Zoe Braddock posted an image to the channel

We are starting to see very early signs of a decrease in open_fds component saturation on the runner managers (thats a good thing)

https://dashboards.gitlab.net/goto/7Kybe9CHR?orgId=1

02:45:03 Message from Siddharth Kannan

Siddharth Kannan's message was pinned by Tarun Khandelwal

@hordur This is the MR which reverts all three of those MRs, is someone from Verify available to take a look at it?

gitlab-org/gitlab!205960 (merged)

I had to fix one merge conflict though. I put the detail for the conflict resolution in the MR description, if that helps.

02:52:05 Update shared

Thiago Figueiro (via @incident) shared an update

We've deployed the rollback to address the file descriptor saturation on CI runners, and early signs of recovery are visible.

Open file descriptor utilization on runner managers is decreasing, and the number of succeeded jobs is rising. The rate of HTTP 429 errors has dropped significantly in the last several minutes, indicating the system is stabilizing.

We're continuing to monitor metrics and will confirm full recovery once the backlog of jobs is cleared and job success rates return to normal.

02:53:18 Image posted by Zoe Braddock

Zoe Braddock posted an image to the channel

The 429s are gone

https://log.gprd.gitlab.net/app/r/s/7emKT

02:56:43 Message from Zoe Braddock

Zoe Braddock's message was pinned by Tarun Khandelwal

For the retro - we realized very early on in this incident that the problem was likely linked to these two MRs - gitlab-org/gitlab!204751 (merged) and gitlab-org/gitlab!204265 (merged). We focused on rolling back these changes, however it is possible that we might have been able to resolve this incident faster if we had logged into the Gitlab.com (http://Gitlab.com) application and manually increased these rate limits to a very high number. This is not a criticism of the responders - just an interesting observation for next time.

03:01:48 Status changed from Investigating → Fixing

Thiago Figueiro shared an update

Status: ~~Investigating~~ → Fixing

03:01:48 Identified at

Custom timestamp "Identified at" occurred

03:02:11 Status changed from Fixing → Monitoring

Thiago Figueiro shared an update

Status: ~~Fixing~~ → Monitoring

03:02:11 Fixed at

Custom timestamp "Fixed at" occurred

03:02:11 Monitoring at

Custom timestamp "Monitoring at" occurred

03:06:11 Message from Zoe Braddock

Zoe Braddock pinned their own message

@anton Starovoytov pointed out that its just as likely we might have found that the problem was with the relevant code paths rather than the specific values, or we might have needed to wait for the same SIRT approval, and rollback was likely required anyway.

I guess it is impossible to time travel back anyway 🙂

03:10:00 Message from Siddharth Kannan

Siddharth Kannan's message was pinned by Tarun Khandelwal

Thank you for the reviews! I have set the revert MR gitlab-org/gitlab!205960 (merged) to Merge when pipeline succeeds

03:26:03 Message from Chris Stone

Chris Stone's message was pinned by Thiago Figueiro

Hi folks. As an aside and not to detract focus from the immediate problem, I was wondering if we'll see an issue with compute minutes being consumed due to this incident. I noticed a job of mine still running for over 2 hours with a 1 hour timeout. I recall we had a similar issue before, just wanted to confirm if we will likely see the same resulting in refund requests...

Support response issue (https://gitlab.com/gitlab-com/support/support-team-meta/-/issues/7191#note_2770040310)

04:50:32 Message from Thiago Figueiro

Thiago Figueiro pinned their own message

@jprovaznik is the next IMOC.

Jan, this incident is now resolved. There are follow-ups:

A revert MR (gitlab-org/gitlab!205960 (merged)) needs to be deployed because it contains an unrelated S1 SIRT fix. This is in progress.
We need to investigate whether CI minutes were consumed https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20592#note_2770090148 and, if so, start a credit for affected users.
A single customer has reported that their self-hosted runner still can’t run jobs. Support is following this up gitlab-org/gitlab!205960 (comment 2770147070)

04:52:08 Incident resolved and entered the post-incident flow

Thiago Figueiro shared an update

Status: ~~Monitoring~~ → Documenting

06:29:19 Message from Siddharth Kannan

Siddharth Kannan's message was pinned by Jan Provazník

👋

A revert MR (gitlab-org/gitlab!205960 (merged)) needs to be deployed because it contains an unrelated S1 SIRT fix. This is in progress. The revert MR does not contain a SIRT fix itself.

It is a revert of the three MRs which caused this incident. The description of gitlab-org/gitlab!205960 (merged) links the three MRs that we believe caused this issue.

We need to deploy gitlab-org/gitlab!205960 (merged) because we need to deploy a separate MR, which is a SIRT FIX

07:05:25 Message from Jan Provazník

Jan Provazník pinned their own message

Re-opening the incident until the revert MR is merged and deployed - I think it's more accurate than current resolved status

07:06:06 Incident re-opened

Jan Provazník shared an update

Status: ~~Documenting~~ → Monitoring

07:06:06 Documented at

Custom timestamp "Documented at" occurred

07:08:30 Update shared

Jan Provazník shared an update

The incident is mitigated on .com because we rolled back previous deployment. We are waiting for gitlab-org/gitlab!205960 (merged) (which reverts root cause of the issue) to get merged and deployed to .com.

07:41:34 Update shared

Jan Provazník (via @incident) shared an update

The incident has been mitigated on GitLab.com by rolling back the previous deployment, which addressed the open file descriptor saturation and restored job processing for CI runners.

We are now monitoring the system while we wait for the revert merge request !205960, which undoes the root cause changes, to be merged and deployed to production. No further customer impact is expected, but we will confirm full resolution once the revert is live and metrics remain stable.

2h later

09:55:38 Update shared

Jan Provazník (via @incident) shared an update

The revert merge request !205960, which undoes the root cause of the CI runners file descriptor saturation, has been cherry-picked into the latest auto-deploy branch and is scheduled to begin deployment within the next hour.

We continue to monitor system health and job processing on GitLab.com. No new customer impact has been reported since the initial mitigation. We will confirm full resolution after the revert is deployed and all key metrics remain stable.

7h later

17:06:20 Update shared

Terri Chu (via @incident) shared an update

The revert merge request !205960 that addresses the CI runners file descriptor saturation was deployed to gprd-cny, but this environment was drained due to a separate incident affecting Terraform state endpoints. Deployment to production is pending until that issue is resolved and gprd-cny is enabled again.

We continue to monitor for any new customer impact. Once the revert is promoted to production and metrics confirm stability, we will confirm full resolution.

2025-09-24

04:09:28 Message from Thiago Figueiro

Thiago Figueiro pinned their own message

This incident was the root cause of the S3 incident #inc-4132-merge-request-home-page-is-empty

The rate-limiting that caused this incident also prevented a cache file from being uploaded for a GitLab image build (https://dev.gitlab.org/gitlab/gitlab-ee/-/jobs/30827983), so we built and deployed and image with broken assets.

Stan created a fix to prevent the same problem on the image creation side gitlab-org/gitlab!206131 (merged)

04:10:38 Message from Siddharth Kannan

Siddharth Kannan's message was pinned by Jan Provazník

A package with the revert MR is being deployed to gprd now:

https://gitlab.slack.com/archives/C8PKBH3M5/p1758685489162579

https://gitlab.slack.com/archives/C09GLJ1AVEY/p1758687007110029?thread_ts=1758659136.748209&cid=C09GLJ1AVEY

05:37:03 Message from Siddharth Kannan

Siddharth Kannan's message was pinned by Jan Provazník

The revert MR was released to gprd as part of this package (https://gitlab.slack.com/archives/C8PKBH3M5/p1758690483349359?thread_ts=1758685489.162579&cid=C8PKBH3M5) at 2025-09-24 05:08 UTC

07:11:30 Incident resolved and entered the post-incident flow

Jan Provazník shared an update

Status: ~~Monitoring~~ → Documenting

The revert MR has been deployed to production and all metrics from https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20592#note_2770031211 are OK. Getting this revert MR took a bit longer because of unrelated incidents which caused delay with deployment to production.

Investigation Notes

RCA timeline

Timing-Dependent Vulnerability: The incident occurred during UTC night hours when even though job load was lower, idle instance pools were also scaled down, providing less buffer capacity to absorb the impact and fewer idle instances available for recovery.

Cascading Failure Pattern: The incident followed a specific cascade:

Rate limiting causes trace update failures
File descriptors accumulate from multiple sources: retry mechanisms, elevated memory/CPU/IOPS usage, and jobs stuck in various retry loops
Docker Machine operations start failing due to FD exhaustion
Autoscaling fails, creating no_free_executor conditions
System enters death spiral as Docker Machine keeps trying to create/remove instances

Follow-ups

Follow-up

Owner

2025-09-23: Open file descriptor utilization near capacity on ci-runners main stage

Unassigned

Assess incident impact on customer compute minute usage and refunds

Nicole Williams

Interesting to know that despite the very large decrease in successfully...

Zoe Braddock

Review Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

For the person opening the Incident Review

Set the title to Incident Review: (Incident issue name)
Assign a Service::* label (most likely matching the one on the incident issue)
Set a Severity::* label which matches the incident
In the Key Information section, make sure to include a link to the incident issue
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.

For the assigned DRI

Fill in the remaining fields in the Key Information section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find.
If there are metrics showing Customers Affected or Requests Affected, link those metrics in those fields
Create a few short sentences in the Summary section summarizing what happened (TL;DR)
Link any corrective actions and describe any other actions or outcomes from the incident
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?
Once discussion wraps up in the comments, summarize any takeaways in the details section
If the incident timeline does not contain any sensitive information and this review can be made public, turn off the issue's confidential mode and link this review to the incident issue.
Close the review before the due date
Go back to the incident channel or page and close out the remaining post-incident tasks

Edited Sep 30, 2025 by Thiago Figueiró