Skip to content

Introduce Exponential Backoff Retries to the gitlab repo clone

GitLab has been seeing an increased rate of failure for pipeline jobs due to GitLab being unable to facilitate certain kinds of load. While it is understandable to need to reject requests in order to maintain stability of the platform, this is resulting in a developer time cost and pipelines which should be passing now require additional human attention to proceed.

https://gitlab.com/gitlab-org/gitlab/-/jobs/4893296566
https://gitlab.com/gitlab-org/gitlab/-/jobs/4893296520

Screenshot_from_2023-08-18_12-31-56

Screenshot_from_2023-08-18_12-32-05

As can be seen by these jobs, these git clone requests are already being retried, but whatever load is preventing GitLab from processing them does not subside in the moments it takes for them to retry. I would recommend we implement some kind of exponential backoff mechanism here so that these retries are spaced out over a longer duration, increasing the chances for success, and likely avoiding a sudden spike of failing requests to GitLab when it is already under load.

Investigation with @ddieulivol found there is a way to configure this retry count in our runners, but we will need to develop a bespoke exponential backoff mechanism for these into the runners.

Edited by Gregory Havenga