Retry mechanism in case of exceeded quota on namespace (!4592) · Merge requests · GitLab.org / gitlab-runner

Giulio Tripi requested to merge giuliotripi/gitlab-runner:main into main Jan 24, 2024

What does this MR do?

This MR adds a retry mechanism in case of exceeded quota on namespace for the kubernetes executor

Why was this MR needed?

When the kubernetes executor is used, if the target namespace reached the resource quota, the job fails with an error message that could confuse a developer with a reduced knowledge of kubernetes.

In addition, since usually the saturation of the resources is temporary, the developer has to wait and after some time launch again the job, hoping that this time the resources were free.

The workaround was to set an hardcoded limits on the number of jobs that could be taken by the runner, even if some jobs has less resource requests than others, therefore not consuming all available resources.

This MR adds a retry mechanism during the creation of the pod. If the error returned from kubernetes is a exceeded quota error, then the runner waits and try to create the pod again.

It is introduced a FF (FF_RETRY_ON_KUBE_EXCEEDED_QUOTA) default false that enables this mechanism (we could instead enable it by default) and a toml configuration for how many times to retry before giving up (retry_times_on_kube_exceeded_quota) default 80. The interval between each retry is 15s. So by default the runner waits no more than 20 minutes.

What's the best way to test this MR?

Enable the feature flag, then try to execute two jobs with the kubernetes executor in a namespace where only resources for one job are available. The first job will operate as before, meanwhile the second job will wait until the first job terminates (given that the first job terminates in less than 20 minutes).

Here is reported a screenshot of an example execution:

What are the relevant issue numbers?

Closes #28184 (closed) #29625 (closed)

Edited Feb 07, 2024 by Giulio Tripi

Retry mechanism in case of exceeded quota on namespace

What does this MR do?

Why was this MR needed?

What's the best way to test this MR?

What are the relevant issue numbers?

Merge request reports