Incident review: 2026-03-05: CI/CD Pipelines stuck in running status

INC-8087: CI/CD Pipelines stuck in running status

Generated by Steve Abrams on 6 Mar 2026 20:54. All timestamps are local to Etc/UTC

Key Information

Metric Value
Customers Affected Many (~3,500 pipelines across broadly distributed projects; most had 1 stuck pipeline, highest had 15)
Requests Affected ~3,500 pipelines stuck; initial backlog estimate 68k-130k in 'running' state
Incident Severity Severity 2
Impact Start Time 2026-03-03 14:00:00 UTC
Impact End Time 2026-03-06 12:46:09 UTC
Total Duration 2 days, 22 hours
Time to Declaration ~54.9 hours
Time to Fix ~70.8 hours
Incident Lead Steve Abrams (final); Aaron Richter, Furhan Shabir, Vasilii Iakliushin (rotated)
Reporter Alejandro Guerrero
Link to Incident Issue #21462 (closed), https://app.incident.io/gitlab/incidents/8087

Summary

Problem: CI/CD pipeline and job statuses did not transition to completed. Two distinct failure modes were observed: (1) jobs stuck in 'Created' status with pipelines stuck on 'Running', and (2) all jobs completing but pipelines still stuck on 'Running' status. The PipelineProcessWorker, which is responsible for transitioning pipeline state based on job statuses, was deferred by the Sidekiq concurrency limiter and never resumed.

Impact: About 3,500 pipelines across many customer projects became stuck in 'running' status and could not complete. The stuck pipelines were distributed broadly -- most affected projects had only 1 stuck pipeline, with the highest being 15. Customers could not complete CI/CD workflows, and deployments and feature flag changes were blocked until manually canceled and retried. No resources were consumed by the stuck pipelines.

Causes: Jobs deferred by the concurrency limit were never resumed, causing pipelines to stay in the running state. A race condition in the concurrency limit queue draining logic allowed the exclusive lease to expire mid-operation, enabling another process to interfere and lose jobs. The concurrency_limit_eager_resume_processing feature flag (enabled ~Feb 22) exposed this race condition. DB latency spikes drove PipelineProcessWorker concurrency to 53.3k, triggering mass deferrals. Disabling the concurrency limiter for PipelineProcessWorker prevented new pipelines from getting stuck. The underlying bug matches previously reported issues (gitlab-org/gitlab#580466, gitlab-org/gitlab#582085).

Response strategy: We disabled the concurrency limiter for PipelineProcessWorker, which stopped new pipelines from getting stuck. The concurrency_limit_eager_resume_processing feature flag was then disabled to address the race condition, and the concurrency limiter was re-enabled. Customers can cancel and retry affected pipelines as a workaround. We decided not to process stuck pipelines automatically to avoid unsafe or stale state changes (risk of stale deployments, package overwrites). Stan Hu authored MR !226342 to fix the exclusive lease renewal race condition. A long-term self-healing mechanism is being tracked in gitlab-org/gitlab#582085.

This is a repeat incident -- the same underlying bug was behind incident 5372 (November 2025) and prior customer issues (gitlab-com/request-for-help#3732).

What went well?

  1. Fast root cause correlation: The team quickly connected the stuck pipelines to the previously reported concurrency limiter bug (gitlab-org/gitlab#580466 / #582085), which accelerated root cause identification.
  2. Strong collaborative investigation: Aaron Richter drove thorough Kibana and Grafana analysis of Sidekiq worker concurrency, DB latency, and queue sizes. Max Fan, Stan Hu, Cameron McFarland, and others contributed key observations that built a clear picture of the failure.
  3. Clean, effective mitigation: Disabling the concurrency limiter via feature flag was a targeted fix that stopped new pipelines from getting stuck without broader side effects. The mitigation was later refined -- the eager resume flag was disabled (addressing the contributing factor), and the concurrency limiter was re-enabled.
  4. Prompt escalation: Alejandro Guerrero escalated promptly once the pattern was identified, and the EOC/IMOC escalation paths brought in the right people quickly.
  5. Safety-first decision on backlog: The team correctly assessed that auto-processing stuck pipelines could trigger unsafe stale mutations (deployments, package overwrites) and chose not to do so, despite pressure to resolve the backlog.
  6. Rapid fix MR: Stan Hu authored MR !226342 (fix race condition in concurrency limit queue draining) on the same day the incident was resolved.

What was difficult?

  1. ~55 hour detection gap: Impact started 2026-03-03 14:00 UTC but the incident was not declared until 2026-03-05 20:56 UTC. There was no automated alerting for pipelines stuck in 'running' state with no active jobs. The issue was only caught when enough customers reported it.
  2. Repeat incident without prior fix: This is the same root cause as incident 5372 (November 2025). The long-term fix (self-healing stuck pipelines, gitlab-org/gitlab#582085) and the underlying concurrency limiter bug (gitlab-org/gitlab#580466) had not been prioritized, allowing recurrence.
  3. Hesitation around feature flag changes: During initial investigation, the team was concerned that disabling the concurrency limiter feature flag could introduce other problems, which delayed mitigation by several hours.
  4. Two distinct failure modes: Max Fan identified that some pipelines were stuck because jobs were never picked up, while others were stuck after all jobs finished. This complicated diagnosis and scoping.
  5. Incomplete pipeline scoping methods: Hordur Freyr Yngvason discovered that prior methods for determining affected pipelines were incomplete because external statuses (generic commit statuses) also impact pipeline processing and were not captured by checking running builds alone.
  6. Large stuck pipeline backlog with no safe remediation: After mitigation, 3,500-8,000+ pipelines remained stuck. Automated remediation was deemed unsafe (risk of stale deployments/package overwrites), leaving manual cancel-and-retry as the only customer option.

Contributing Factors

1. Race condition in concurrency limit queue draining (Primary)

When the concurrency_limit_eager_resume_processing feature flag was enabled (~Feb 22), the resume_processing! method was changed to loop and drain the throttled queue in batches. However, the exclusive lease was not being renewed between loop iterations, allowing it to expire mid-operation. When the lease expired, another process could obtain it and interfere with queue operations, causing PipelineProcessWorker jobs to be lost or duplicated.

Fix: MR !226342 -- Renew the lease before each iteration, ensuring exclusive access throughout the entire draining operation.

2. Feature flag: concurrency_limit_eager_resume_processing

Enabled around February 22nd. This flag changed the resume processing behavior to eagerly drain the throttled queue, which exposed the lease expiration race condition under high concurrency. The underlying bug existed before, but this flag significantly increased the probability of triggering it.

3. Database latency spike under heavy load

A DB latency spike around 2026-03-05 00:02 UTC caused PipelineProcessWorker to slow down, driving concurrency to 53.3k and triggering mass deferrals by the concurrency limiter. Rails request latency also spiked at 15:22 UTC, coinciding with the concurrency limit queue spike.

4. No self-healing mechanism for stuck pipelines

There is no "stuck pipeline worker" analogous to the existing stuck build worker. Once PipelineProcessWorker jobs are lost, pipelines remain permanently stuck with no automatic recovery. This has been a known gap since at least 2019 (gitlab-org/gitlab#36237 (closed)).

5. No alerting for stuck pipelines

There is no automated alerting for pipelines stuck in 'running' state with no active jobs, which allowed the ~55 hour detection gap.

Issue Description Status
gitlab-org/gitlab#580466 Bug: Jobs marked with concurrency_limit status not being requeued by ResumeWorker Open (assigned: Stan Hu, milestone: 18.10)
gitlab-org/gitlab#582085 Self-heal pipelines stuck with executing status with no executing builds Open (priority::2, severity::2)
gitlab-org/gitlab!208142 (merged) Reorder DuplicateJobs/ConcurrencyLimit middleware (merged Oct 2025) Merged (gated by env var)
gitlab-org/gitlab!226342 (merged) Fix race condition in concurrency limit queue draining Open (pipeline passing, in review)
gitlab-com/gl-infra/production#20833 Prior change request: Re-execute PipelineProcessWorker (Nov 2025) Closed
gitlab-org/gitlab#36237 (closed) Historic issue: pipeline shows running when all jobs finished (2019) Closed

Investigation Details

Timeline (Curated)

Tuesday 2026-03-03

Time (UTC) Event
14:00 Impact started -- Pipelines begin getting stuck in 'running' state

Thursday 2026-03-05 (2 days later)

Time (UTC) Event
00:02 Large error spike in PipelineProcessWorker (correlated retroactively via Kibana)
20:56 Incident declared (Alejandro Guerrero). Severity 2. Deployments and feature flags blocked. Escalated to GitLab.com Production and IMOC
21:01 Aaron Richter takes Incident Lead
21:02 Confirmed: multiple customers reporting stuck pipelines; at least one since 2 days ago
21:12 Alejandro Guerrero: spikes of PipelineProcessWorker hitting concurrency_limit in the past 3 days
21:42 Max Fan identifies two distinct failure modes: jobs never picked up vs. all jobs finished but pipeline stuck
21:46 Aaron Richter: Kibana shows large PipelineProcessWorker error spike at 00:02, no other errors, divorced from deploys/FFs
21:59 Aaron Richter: 53.3k concurrent PipelineProcessWorker jobs observed
22:10 Cameron McFarland: significant DB waiting observed
22:16 Aaron Richter: Rails request latency spike at 15:22 UTC coinciding with concurrency limit queue spike
22:24 Aaron Richter: deduplication middleware runs before concurrency limiting -- last job can be deduplicated away
22:30 First update shared: initial hypothesis of deduplication/concurrency limit middleware bug
22:41 Stan Hu: manually running PipelineProcessWorker is required to resume currently stuck pipelines
22:54 Team concerned that disabling the concurrency limiter feature flag could introduce other problems
22:57 Alejandro Guerrero links to previously reported issue gitlab-org/gitlab#582085
23:00 Stan Hu identifies concurrency_limit_eager_resume_processing feature flag (enabled ~Feb 22) as possible contributor
23:02 Root cause correlated to known concurrency limiter bug
23:07 Incident Lead passed to Furhan Shabir
23:13 Max Fan commits to prioritizing self-healing mechanism at SEV2
23:28 Stan Hu: exclusive lease mechanism may have failed, allowing concurrent cron job executions
23:37 Thiago Figueiro update: exclusive lease failure identified; active impact subsiding
23:39 Max Fan: DB spikes preceded the queue buildup (multiple dashboard links shared)

Friday 2026-03-06

Time (UTC) Event
01:16 Incident paused (Furhan Shabir): not actively occurring, but could recur under high load. Marked as "Identified"
01:36 Deployments unblocked
03:41 Feature flags unblocked
05:00 Incident auto-resumed
06:33 Gregorius Marco confirms PipelineProcessWorker still being deferred (job_status = concurrency_limit)
06:50 Action created: Disable concurrency limiter for PipelineProcessWorker
06:53 Action completed (Furhan Shabir): Feature flag disable_sidekiq_concurrency_limit_middleware_PipelineProcessWorker enabled globally via ChatOps
07:04 Update shared: monitoring to confirm mitigation
12:46 Fixed: Confirmed mitigation effective. Incident paused 24h for monitoring (Pravar Gauba)
17:02 Steve Abrams re-escalates to IMOC
17:06 Vasilii Iakliushin takes Incident Lead
17:38 Vasilii Iakliushin: 68k-130k pipelines still in 'running' state backlog; no new stuck pipelines
17:55 Hordur Yngvason: most "running" pipelines from last 48h are actually stuck (zombie problem)
18:28 Hordur Yngvason: prior scoping methods incomplete -- external statuses also affect pipeline processing
19:06 Team determines restarting stuck pipelines after long delay is unsafe (risk of stale deployments)
19:15 Confirmed: stuck pipelines dropped to near zero in last 12h -- fix is working
19:19 Refined scope: ~3,500-8,000 stuck pipelines in main window
19:27 Stuck pipelines distributed broadly across many projects (not concentrated)
20:27 Stan Hu opens MR !226342: Fix race condition in concurrency limit queue draining
20:38 Incident resolved (Steve Abrams): Will not auto-process backlog; customers advised to cancel and retry
20:56 Stan Hu: concurrency_limit_eager_resume_processing played a role in losing jobs; recommends disabling until MR !226342 lands
21:50 Hordur Yngvason: concurrency_limit_eager_resume_processing disabled; concurrency limiter re-enabled for PipelineProcessWorker (safe now)

Zoom Call Summaries

2026-03-06 evening call: The team investigated the scope and impact of stuck CI pipelines, determining that while thousands of pipelines remain stuck, customer impact is low and the issue is resolved for new pipelines. Hordur Yngvason determined most "running" pipelines from the last 48h were actually stuck, and that external statuses also affect pipeline processing. The team agreed restarting stuck pipelines is unsafe, and no immediate bulk action is needed. Follow-up work planned for backlog cleanup and system hardening.

2026-03-05 evening call: The team investigated customer pipelines stuck in running state despite all jobs completing. Aaron Richter identified the deduplication/concurrency limit middleware interaction. Stan Hu identified the concurrency_limit_eager_resume_processing feature flag and exclusive lease failure as contributing factors. Root cause correlated to known bug gitlab-org/gitlab#582085.

Actions

Action Owner Status
Disable concurrency limiter for PipelineProcessWorker via disable_sidekiq_concurrency_limit_middleware_PipelineProcessWorker Furhan Shabir Completed (06:53 UTC)
Disable concurrency_limit_eager_resume_processing feature flag Post-resolution Completed
Re-enable concurrency limiter for PipelineProcessWorker (safe after eager resume disabled) Hordur Yngvason Completed (~21:50 UTC)

Follow-ups / Corrective Actions

Action Owner Issue Status
Fix race condition in concurrency limit queue draining (lease renewal) Stan Hu gitlab-org/gitlab!226342 (merged) In review (pipeline passing)
Self-heal pipelines stuck with executing status with no executing builds Max Fan / Hordur Yngvason gitlab-org/gitlab#582085 Outstanding
Fix ConcurrencyLimit::ResumeWorker bug (jobs not re-queued) Stan Hu gitlab-org/gitlab#580466 Open (milestone 18.10)
Add alerting for pipelines stuck in 'running' state with no active jobs (detection took ~55h) Hordur Yngvason / Panos Kanellidis gitlab-org/gitlab#592819 Created
Evaluate safety of auto-processing stuck pipelines (threshold after which processing is unsafe) Hordur Yngvason / Max Fan gitlab-org/gitlab#582085 (comment) Discussion
Open FCL issue per S2 process Hordur Yngvason / Cheryl Li Feature Change Lock Created

Self-Managed / Dedicated Implications

This bug affects any GitLab instance using the Sidekiq concurrency limiter with PipelineProcessWorker where the concurrency_limit_eager_resume_processing feature flag is enabled. Self-managed and Dedicated instances could be affected under similar high-load conditions.

Workaround: Disable concurrency_limit_eager_resume_processing feature flag.

Permanent fix: MR !226342 (fix exclusive lease renewal in queue draining) needs to be merged and potentially backported.

Review Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

For the person opening the Incident Review

  • Set the title to Incident Review: (Incident issue name)
  • Assign a Service::* label (most likely matching the one on the incident issue)
  • Set a Severity::* label which matches the incident
  • In the Key Information section, make sure to include a link to the incident issue
  • Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.

For the assigned DRI

  • Fill in the remaining fields in the Key Information section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find.

  • If there are metrics showing Customers Affected or Requests Affected, link those metrics in those fields

  • For all S1 and S2 incidents, begin the Feature Change Lock (FCL) process and open an issue in the FCL project.

  • Create a few short sentences in the Summary section summarizing what happened (TL;DR)

  • Link any corrective actions and describe any other actions or outcomes from the incident

  • Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?

  • Once discussion wraps up in the comments, summarize any takeaways in the details section

  • If the incident timeline does not contain any sensitive information and this review can be made public, turn off the issue's confidential mode and link this review to the incident issue.

  • Close the review before the due date

  • Go back to the incident channel or page and close out the remaining post-incident tasks

Edited by Hordur Freyr Yngvason