Skip to content

Reduce number of Conflicts returned by Commit Status api

What does this MR do and why?

Allows more time for the status update to recover if the sha/pipeline is locked.

Missing a commit status update to a complete status could leave jobs hanging as pending/running which can block users.

Recently we upped the TTL on the lock to 60 seconds. This was based on a few outlier requests taking this long. A shorter lock lead to duplicates which was a bug --since the application expects 1 'current' status of a certain name per sha. note: We can have multiple non-current statuses which are marked as 'retried=true'.

We also started returning 409 conflict which should be a signal to retry the http request, unlike a 500 which was previously used. I've documented that in this MR.

Prior to this change the users could only expect a 500 from a conflict and so they may not have implemented retries on the client end. Especially across the many various integrations.

Today we retry for only 2 seconds based on the configured number of retries and sleep_sec:

    def pipeline_lock_params
      {
        ttl: (Feature.enabled?(:long_pipeline_lock_ttl, project) ? 1.minute : 5.seconds),
        sleep_sec: 0.1.seconds,
        retries: 20
      }
    end

We should increase the sleep_sec so that the request has more chance to recover if the pipeline is locked. Changing sleep_sec to .05 and retries to 20 gives us 10 seconds total retry time - enough to cover the 99th percentile (~2 seconds) with plenty of buffer. For longer request clients can retry on conflict.

Ultimately, a better architecture would be: #575990

Long running requests like the 60 second ones can affect reliability. This affect should be very contained since 99th percentile is around 2 seconds.

Metrics

https://log.gprd.gitlab.net/app/r/s/Cm5pW - We should see number of 409's reduced:

Screenshot_2025-10-09_at_12.07.09_PM

Edited by Allison Browne

Merge request reports

Loading