Reduce number of Conflicts returned by Commit Status api
What does this MR do and why?
Allows more time for the status update to recover if the sha/pipeline is locked.
Missing a commit status update to a complete status could leave jobs hanging as pending
/running
which can block users.
Recently we upped the TTL on the lock to 60 seconds. This was based on a few outlier requests taking this long. A shorter lock lead to duplicates which was a bug --since the application expects 1 'current' status of a certain name per sha. note: We can have multiple non-current statuses which are marked as 'retried=true'.
We also started returning 409
conflict which should be a signal to retry the http request, unlike a 500
which was previously used. I've documented that in this MR.
Prior to this change the users could only expect a 500
from a conflict and so they may not have implemented retries on the client end. Especially across the many various integrations.
Today we retry for only 2 seconds based on the configured number of retries
and sleep_sec
:
def pipeline_lock_params
{
ttl: (Feature.enabled?(:long_pipeline_lock_ttl, project) ? 1.minute : 5.seconds),
sleep_sec: 0.1.seconds,
retries: 20
}
end
We should increase the sleep_sec
so that the request has more chance to recover if the pipeline is locked. Changing sleep_sec to .05 and retries to 20 gives us 10 seconds total retry time - enough to cover the 99th percentile (~2 seconds) with plenty of buffer. For longer request clients can retry on conflict.
Ultimately, a better architecture would be: #575990
Long running requests like the 60 second ones can affect reliability. This affect should be very contained since 99th percentile is around 2 seconds.
Metrics
https://log.gprd.gitlab.net/app/r/s/Cm5pW - We should see number of 409's reduced: