Incident review: 2026-03-05: CI/CD Pipelines stuck in running status
INC-8087: CI/CD Pipelines stuck in running status
Generated by Steve Abrams on 6 Mar 2026 20:54. All timestamps are local to Etc/UTC
Key Information
| Metric | Value |
|---|---|
| Customers Affected | Many (~3,500 pipelines across broadly distributed projects; most had 1 stuck pipeline, highest had 15) |
| Requests Affected | ~3,500 pipelines stuck; initial backlog estimate 68k-130k in 'running' state |
| Incident Severity | Severity 2 |
| Impact Start Time | 2026-03-03 14:00:00 UTC |
| Impact End Time | 2026-03-06 12:46:09 UTC |
| Total Duration | 2 days, 22 hours |
| Time to Declaration | ~54.9 hours |
| Time to Fix | ~70.8 hours |
| Incident Lead | Steve Abrams (final); Aaron Richter, Furhan Shabir, Vasilii Iakliushin (rotated) |
| Reporter | Alejandro Guerrero |
| Link to Incident Issue | #21462 (closed), https://app.incident.io/gitlab/incidents/8087 |
Summary
Problem: CI/CD pipeline and job statuses did not transition to completed. Two distinct failure modes were observed: (1) jobs stuck in 'Created' status with pipelines stuck on 'Running', and (2) all jobs completing but pipelines still stuck on 'Running' status. The PipelineProcessWorker, which is responsible for transitioning pipeline state based on job statuses, was deferred by the Sidekiq concurrency limiter and never resumed.
Impact: About 3,500 pipelines across many customer projects became stuck in 'running' status and could not complete. The stuck pipelines were distributed broadly -- most affected projects had only 1 stuck pipeline, with the highest being 15. Customers could not complete CI/CD workflows, and deployments and feature flag changes were blocked until manually canceled and retried. No resources were consumed by the stuck pipelines.
Causes: Jobs deferred by the concurrency limit were never resumed, causing pipelines to stay in the running state. A race condition in the concurrency limit queue draining logic allowed the exclusive lease to expire mid-operation, enabling another process to interfere and lose jobs. The concurrency_limit_eager_resume_processing feature flag (enabled ~Feb 22) exposed this race condition. DB latency spikes drove PipelineProcessWorker concurrency to 53.3k, triggering mass deferrals. Disabling the concurrency limiter for PipelineProcessWorker prevented new pipelines from getting stuck. The underlying bug matches previously reported issues (gitlab-org/gitlab#580466, gitlab-org/gitlab#582085).
Response strategy: We disabled the concurrency limiter for PipelineProcessWorker, which stopped new pipelines from getting stuck. The concurrency_limit_eager_resume_processing feature flag was then disabled to address the race condition, and the concurrency limiter was re-enabled. Customers can cancel and retry affected pipelines as a workaround. We decided not to process stuck pipelines automatically to avoid unsafe or stale state changes (risk of stale deployments, package overwrites). Stan Hu authored MR !226342 to fix the exclusive lease renewal race condition. A long-term self-healing mechanism is being tracked in gitlab-org/gitlab#582085.
This is a repeat incident -- the same underlying bug was behind incident 5372 (November 2025) and prior customer issues (gitlab-com/request-for-help#3732).
What went well?
- Fast root cause correlation: The team quickly connected the stuck pipelines to the previously reported concurrency limiter bug (gitlab-org/gitlab#580466 / #582085), which accelerated root cause identification.
- Strong collaborative investigation: Aaron Richter drove thorough Kibana and Grafana analysis of Sidekiq worker concurrency, DB latency, and queue sizes. Max Fan, Stan Hu, Cameron McFarland, and others contributed key observations that built a clear picture of the failure.
- Clean, effective mitigation: Disabling the concurrency limiter via feature flag was a targeted fix that stopped new pipelines from getting stuck without broader side effects. The mitigation was later refined -- the eager resume flag was disabled (addressing the contributing factor), and the concurrency limiter was re-enabled.
- Prompt escalation: Alejandro Guerrero escalated promptly once the pattern was identified, and the EOC/IMOC escalation paths brought in the right people quickly.
- Safety-first decision on backlog: The team correctly assessed that auto-processing stuck pipelines could trigger unsafe stale mutations (deployments, package overwrites) and chose not to do so, despite pressure to resolve the backlog.
- Rapid fix MR: Stan Hu authored MR !226342 (fix race condition in concurrency limit queue draining) on the same day the incident was resolved.
What was difficult?
- ~55 hour detection gap: Impact started 2026-03-03 14:00 UTC but the incident was not declared until 2026-03-05 20:56 UTC. There was no automated alerting for pipelines stuck in 'running' state with no active jobs. The issue was only caught when enough customers reported it.
- Repeat incident without prior fix: This is the same root cause as incident 5372 (November 2025). The long-term fix (self-healing stuck pipelines, gitlab-org/gitlab#582085) and the underlying concurrency limiter bug (gitlab-org/gitlab#580466) had not been prioritized, allowing recurrence.
- Hesitation around feature flag changes: During initial investigation, the team was concerned that disabling the concurrency limiter feature flag could introduce other problems, which delayed mitigation by several hours.
- Two distinct failure modes: Max Fan identified that some pipelines were stuck because jobs were never picked up, while others were stuck after all jobs finished. This complicated diagnosis and scoping.
- Incomplete pipeline scoping methods: Hordur Freyr Yngvason discovered that prior methods for determining affected pipelines were incomplete because external statuses (generic commit statuses) also impact pipeline processing and were not captured by checking running builds alone.
- Large stuck pipeline backlog with no safe remediation: After mitigation, 3,500-8,000+ pipelines remained stuck. Automated remediation was deemed unsafe (risk of stale deployments/package overwrites), leaving manual cancel-and-retry as the only customer option.
Contributing Factors
1. Race condition in concurrency limit queue draining (Primary)
When the concurrency_limit_eager_resume_processing feature flag was enabled (~Feb 22), the resume_processing! method was changed to loop and drain the throttled queue in batches. However, the exclusive lease was not being renewed between loop iterations, allowing it to expire mid-operation. When the lease expired, another process could obtain it and interfere with queue operations, causing PipelineProcessWorker jobs to be lost or duplicated.
Fix: MR !226342 -- Renew the lease before each iteration, ensuring exclusive access throughout the entire draining operation.
2. Feature flag: concurrency_limit_eager_resume_processing
Enabled around February 22nd. This flag changed the resume processing behavior to eagerly drain the throttled queue, which exposed the lease expiration race condition under high concurrency. The underlying bug existed before, but this flag significantly increased the probability of triggering it.
3. Database latency spike under heavy load
A DB latency spike around 2026-03-05 00:02 UTC caused PipelineProcessWorker to slow down, driving concurrency to 53.3k and triggering mass deferrals by the concurrency limiter. Rails request latency also spiked at 15:22 UTC, coinciding with the concurrency limit queue spike.
4. No self-healing mechanism for stuck pipelines
There is no "stuck pipeline worker" analogous to the existing stuck build worker. Once PipelineProcessWorker jobs are lost, pipelines remain permanently stuck with no automatic recovery. This has been a known gap since at least 2019 (gitlab-org/gitlab#36237 (closed)).
5. No alerting for stuck pipelines
There is no automated alerting for pipelines stuck in 'running' state with no active jobs, which allowed the ~55 hour detection gap.
Related Issues and Prior Art
| Issue | Description | Status |
|---|---|---|
| gitlab-org/gitlab#580466 | Bug: Jobs marked with concurrency_limit status not being requeued by ResumeWorker | Open (assigned: Stan Hu, milestone: 18.10) |
| gitlab-org/gitlab#582085 | Self-heal pipelines stuck with executing status with no executing builds | Open (priority::2, severity::2) |
| gitlab-org/gitlab!208142 (merged) | Reorder DuplicateJobs/ConcurrencyLimit middleware (merged Oct 2025) | Merged (gated by env var) |
| gitlab-org/gitlab!226342 (merged) | Fix race condition in concurrency limit queue draining | Open (pipeline passing, in review) |
| gitlab-com/gl-infra/production#20833 | Prior change request: Re-execute PipelineProcessWorker (Nov 2025) | Closed |
| gitlab-org/gitlab#36237 (closed) | Historic issue: pipeline shows running when all jobs finished (2019) | Closed |
Investigation Details
Timeline (Curated)
Tuesday 2026-03-03
| Time (UTC) | Event |
|---|---|
| 14:00 | Impact started -- Pipelines begin getting stuck in 'running' state |
Thursday 2026-03-05 (2 days later)
| Time (UTC) | Event |
|---|---|
| 00:02 | Large error spike in PipelineProcessWorker (correlated retroactively via Kibana) |
| 20:56 | Incident declared (Alejandro Guerrero). Severity 2. Deployments and feature flags blocked. Escalated to GitLab.com Production and IMOC |
| 21:01 | Aaron Richter takes Incident Lead |
| 21:02 | Confirmed: multiple customers reporting stuck pipelines; at least one since 2 days ago |
| 21:12 | Alejandro Guerrero: spikes of PipelineProcessWorker hitting concurrency_limit in the past 3 days |
| 21:42 | Max Fan identifies two distinct failure modes: jobs never picked up vs. all jobs finished but pipeline stuck |
| 21:46 | Aaron Richter: Kibana shows large PipelineProcessWorker error spike at 00:02, no other errors, divorced from deploys/FFs |
| 21:59 | Aaron Richter: 53.3k concurrent PipelineProcessWorker jobs observed |
| 22:10 | Cameron McFarland: significant DB waiting observed |
| 22:16 | Aaron Richter: Rails request latency spike at 15:22 UTC coinciding with concurrency limit queue spike |
| 22:24 | Aaron Richter: deduplication middleware runs before concurrency limiting -- last job can be deduplicated away |
| 22:30 | First update shared: initial hypothesis of deduplication/concurrency limit middleware bug |
| 22:41 | Stan Hu: manually running PipelineProcessWorker is required to resume currently stuck pipelines |
| 22:54 | Team concerned that disabling the concurrency limiter feature flag could introduce other problems |
| 22:57 | Alejandro Guerrero links to previously reported issue gitlab-org/gitlab#582085 |
| 23:00 | Stan Hu identifies concurrency_limit_eager_resume_processing feature flag (enabled ~Feb 22) as possible contributor |
| 23:02 | Root cause correlated to known concurrency limiter bug |
| 23:07 | Incident Lead passed to Furhan Shabir |
| 23:13 | Max Fan commits to prioritizing self-healing mechanism at SEV2 |
| 23:28 | Stan Hu: exclusive lease mechanism may have failed, allowing concurrent cron job executions |
| 23:37 | Thiago Figueiro update: exclusive lease failure identified; active impact subsiding |
| 23:39 | Max Fan: DB spikes preceded the queue buildup (multiple dashboard links shared) |
Friday 2026-03-06
| Time (UTC) | Event |
|---|---|
| 01:16 | Incident paused (Furhan Shabir): not actively occurring, but could recur under high load. Marked as "Identified" |
| 01:36 | Deployments unblocked |
| 03:41 | Feature flags unblocked |
| 05:00 | Incident auto-resumed |
| 06:33 | Gregorius Marco confirms PipelineProcessWorker still being deferred (job_status = concurrency_limit) |
| 06:50 | Action created: Disable concurrency limiter for PipelineProcessWorker |
| 06:53 |
Action completed (Furhan Shabir): Feature flag disable_sidekiq_concurrency_limit_middleware_PipelineProcessWorker enabled globally via ChatOps |
| 07:04 | Update shared: monitoring to confirm mitigation |
| 12:46 | Fixed: Confirmed mitigation effective. Incident paused 24h for monitoring (Pravar Gauba) |
| 17:02 | Steve Abrams re-escalates to IMOC |
| 17:06 | Vasilii Iakliushin takes Incident Lead |
| 17:38 | Vasilii Iakliushin: 68k-130k pipelines still in 'running' state backlog; no new stuck pipelines |
| 17:55 | Hordur Yngvason: most "running" pipelines from last 48h are actually stuck (zombie problem) |
| 18:28 | Hordur Yngvason: prior scoping methods incomplete -- external statuses also affect pipeline processing |
| 19:06 | Team determines restarting stuck pipelines after long delay is unsafe (risk of stale deployments) |
| 19:15 | Confirmed: stuck pipelines dropped to near zero in last 12h -- fix is working |
| 19:19 | Refined scope: ~3,500-8,000 stuck pipelines in main window |
| 19:27 | Stuck pipelines distributed broadly across many projects (not concentrated) |
| 20:27 | Stan Hu opens MR !226342: Fix race condition in concurrency limit queue draining |
| 20:38 | Incident resolved (Steve Abrams): Will not auto-process backlog; customers advised to cancel and retry |
| 20:56 | Stan Hu: concurrency_limit_eager_resume_processing played a role in losing jobs; recommends disabling until MR !226342 lands |
| 21:50 | Hordur Yngvason: concurrency_limit_eager_resume_processing disabled; concurrency limiter re-enabled for PipelineProcessWorker (safe now) |
Zoom Call Summaries
2026-03-06 evening call: The team investigated the scope and impact of stuck CI pipelines, determining that while thousands of pipelines remain stuck, customer impact is low and the issue is resolved for new pipelines. Hordur Yngvason determined most "running" pipelines from the last 48h were actually stuck, and that external statuses also affect pipeline processing. The team agreed restarting stuck pipelines is unsafe, and no immediate bulk action is needed. Follow-up work planned for backlog cleanup and system hardening.
2026-03-05 evening call: The team investigated customer pipelines stuck in running state despite all jobs completing. Aaron Richter identified the deduplication/concurrency limit middleware interaction. Stan Hu identified the concurrency_limit_eager_resume_processing feature flag and exclusive lease failure as contributing factors. Root cause correlated to known bug gitlab-org/gitlab#582085.
Actions
| Action | Owner | Status |
|---|---|---|
Disable concurrency limiter for PipelineProcessWorker via disable_sidekiq_concurrency_limit_middleware_PipelineProcessWorker
|
Furhan Shabir | Completed (06:53 UTC) |
Disable concurrency_limit_eager_resume_processing feature flag |
Post-resolution | Completed |
| Re-enable concurrency limiter for PipelineProcessWorker (safe after eager resume disabled) | Hordur Yngvason | Completed (~21:50 UTC) |
Follow-ups / Corrective Actions
| Action | Owner | Issue | Status |
|---|---|---|---|
| Fix race condition in concurrency limit queue draining (lease renewal) | Stan Hu | gitlab-org/gitlab!226342 (merged) | In review (pipeline passing) |
| Self-heal pipelines stuck with executing status with no executing builds | Max Fan / Hordur Yngvason | gitlab-org/gitlab#582085 | Outstanding |
| Fix ConcurrencyLimit::ResumeWorker bug (jobs not re-queued) | Stan Hu | gitlab-org/gitlab#580466 | Open (milestone 18.10) |
| Add alerting for pipelines stuck in 'running' state with no active jobs (detection took ~55h) | Hordur Yngvason / Panos Kanellidis | gitlab-org/gitlab#592819 | Created |
| Evaluate safety of auto-processing stuck pipelines (threshold after which processing is unsafe) | Hordur Yngvason / Max Fan | gitlab-org/gitlab#582085 (comment) | Discussion |
| Open FCL issue per S2 process | Hordur Yngvason / Cheryl Li | Feature Change Lock | Created |
Self-Managed / Dedicated Implications
This bug affects any GitLab instance using the Sidekiq concurrency limiter with PipelineProcessWorker where the concurrency_limit_eager_resume_processing feature flag is enabled. Self-managed and Dedicated instances could be affected under similar high-load conditions.
Workaround: Disable concurrency_limit_eager_resume_processing feature flag.
Permanent fix: MR !226342 (fix exclusive lease renewal in queue draining) needs to be merged and potentially backported.
Review Guidelines
This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.
For the person opening the Incident Review
-
Set the title to
Incident Review: (Incident issue name) -
Assign a
Service::*label (most likely matching the one on the incident issue) -
Set a
Severity::*label which matches the incident -
In the
Key Informationsection, make sure to include a link to the incident issue - Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.
For the assigned DRI
-
Fill in the remaining fields in the
Key Informationsection, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find. -
If there are metrics showing
Customers AffectedorRequests Affected, link those metrics in those fields -
For all S1 and S2 incidents, begin the Feature Change Lock (FCL) process and open an issue in the FCL project.
-
Create a few short sentences in the Summary section summarizing what happened (TL;DR)
-
Link any corrective actions and describe any other actions or outcomes from the incident
-
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?
-
Once discussion wraps up in the comments, summarize any takeaways in the details section
-
If the incident timeline does not contain any sensitive information and this review can be made public, turn off the issue's confidential mode and link this review to the incident issue.
- S1 incidents require a public RCA within 7 days of the incident. If this review cannot be made public, create a separate public RCA.
-
Close the review before the due date
-
Go back to the incident channel or page and close out the remaining post-incident tasks