Skip to content

Enhance webhook recursion protection for CI events to prevent resource exhaustion

Customer Request

Customer: Enterprise Account (anonymized)

Impact: Production outage due to webhook recursion leading to resource exhaustion

Problem Statement

An Enterprise customer experienced a production outage caused by cyclical webhook firing. The root cause was identified as webhook recursion triggered by both pipeline and jobs events, which led to:

  • Complete resource exhaustion in their infrastructure
  • Sidekiq being flooded with jobs
  • Production system becoming unavailable

Current State

We have existing recursion protection implemented by groupimport (#329743 (closed)), however this protection appears insufficient for CI-specific webhook scenarios.

We also have rate-limiting for webhooks which provides some protection against abuse, but this wasn't sufficient to prevent the customer impact in this case.

Requested Enhancement

The customer is requesting clarification and potential improvements to our webhook recursion protection, specifically:

  1. Scope Assessment: Clarify the current scope of recursion protection and whether it adequately covers CI webhook scenarios
  2. CI-Specific Protection: Evaluate if CI webhooks (pipeline and job events) need specialized recursion detection and prevention
  3. Resource Protection: Implement additional safeguards to prevent webhook recursion from causing complete resource exhaustion

Business Impact

  • Customer Impact: Production outage for Enterprise customer
  • Reliability Risk: Similar issues could affect other Enterprise customers with CI-heavy workflows
  • Resource Protection: Need to prevent webhook recursion from overwhelming customer infrastructure

Acceptance Criteria

  • Audit current webhook recursion protection to identify gaps in CI event handling
  • Determine if existing rate-limiting is sufficient or if additional CI-specific protections are needed
  • Implement enhanced recursion detection for pipeline and job webhook events
  • Add safeguards to prevent resource exhaustion scenarios
  • Document the scope and limitations of webhook recursion protection

Additional Context

This appears to be an edge-case combination of pipeline and job events triggering recursive webhooks, but given the severe customer impact, we should ensure our protection mechanisms are robust enough to handle these scenarios.

The customer infrastructure doesn't scale infinitely, making them vulnerable to resource exhaustion from webhook floods, which is likely a common scenario for Enterprise customers.

Related Issue: #329743 (closed) (existing recursion protection)

Edited by 🤖 GitLab Bot 🤖