feat: retry on http error

What

Change the default HTTP.Client to github.com/hashicorp/go-retryablehttp.Client to get automatic retries and exponential backoff.

We retry the request 2 times resulting in 3 attempts of sending the request, the min retry wait is 1 second, and the maximum is 15 seconds.

Hide the retry logic behind a temporary feature flag FF_GITLAB_SHELL_RETRYABLE_HTTP to easily roll this out in GitLab.com. When we verify that this works as expected we will remove FF_GITLAB_SHELL_RETRYABLE_HTTP and have the retry logic as the default logic.

Why

In gitlab-com/gl-infra/production#7979 (closed) users end up seeing the following errors when trying to git-clone(1) a repository locally on in CI.

remote: ===============================
remote:
remote: ERROR: Internal API unreachable
remote:
remote: ================================

When we look at the application logs we see the following error:

{ "err": "http://gitlab-webservice-git.gitlab.svc:8181/api/v4/internal/allowed":
dial tcp 10.69.184.120:8181: connect: connection refused", "msg":
"Internal API unreachable"}

In gitlab-com/gl-infra/production#7979 (comment 1222670120) we've correlated these connection refused errors with infrastructure events that remove the git pods that are hosting gitlab-webservice-git service. We could try to make the underlying infrastructure more reactive to these changes as suggested in gitlab-com/gl-infra/production#7979 (comment 1225164944) but we can still end up serving bad requests.

Implementing retry logic for 5xx or other errors would allow users to still be able to git-clone(1) repositories, although it is slower. This is especially important during CI runs so users don't have to retry jobs themselves.

Reference: gitlab-com/gl-infra/production#7979 (closed) Reference: #604 (closed)

Edited by Steve Xuereb

Merge request reports

Loading