feat: retry on http error
What
Change the default HTTP.Client to
github.com/hashicorp/go-retryablehttp.Client to get automatic retries
and exponential backoff.
We retry the request 2 times resulting in 3 attempts of sending the request, the min retry wait is 1 second, and the maximum is 15 seconds.
Hide the retry logic behind a temporary feature flag FF_GITLAB_SHELL_RETRYABLE_HTTP to easily roll this out in GitLab.com. When we verify that this works as expected we will remove FF_GITLAB_SHELL_RETRYABLE_HTTP and have the retry logic as the default logic.
Why
In gitlab-com/gl-infra/production#7979 (closed) users
end up seeing the following errors when trying to git-clone(1) a
repository locally on in CI.
remote: ===============================
remote:
remote: ERROR: Internal API unreachable
remote:
remote: ================================
When we look at the application logs we see the following error:
{ "err": "http://gitlab-webservice-git.gitlab.svc:8181/api/v4/internal/allowed":
dial tcp 10.69.184.120:8181: connect: connection refused", "msg":
"Internal API unreachable"}
In
gitlab-com/gl-infra/production#7979 (comment 1222670120)
we've correlated these connection refused errors with infrastructure
events that remove the git pods that are hosting
gitlab-webservice-git service. We could try to make the underlying
infrastructure more reactive to these changes as suggested in
gitlab-com/gl-infra/production#7979 (comment 1225164944)
but we can still end up serving bad requests.
Implementing retry logic for 5xx or other errors would allow users to
still be able to git-clone(1) repositories, although it is slower.
This is especially important during CI runs so users don't have to retry
jobs themselves.
Reference: gitlab-com/gl-infra/production#7979 (closed) Reference: #604 (closed)