Retry logic when receiving 503 and 429 error codes from Azure Blob Storage
Summary
We're encountering 500 errors when downloading artifacts (or other files) from object storage. This appears to happen when rate limiting (due to hitting Azure usage thresholds) or poor server performance, and is causing artifact downloads from Azure object storage to fail.
What I think is happening: GitLab is receiving error codes that it isn't expecting from Azure Blob Storage, resulting in 500 errors in GitLab. I believe we should add retry logic if certain codes are returned, like 429 or 503.
Steps to reproduce
- Set up GitLab with artifact storage hosted on [https://azure.microsoft.com/en-us/services/storage/blobs/], choosing the Standard tier, and not the Premium tier.
- Produce some trace logs by setting up and running some simple pipelines
- Write a simple bash script that gets the job logs for a complete job in a loop
- Many of the calls should fail with a 500
- Check the Azure blob storage logs on the azure control panel and you'll see the 503 errors
WIP: Not sure how to reproduce the 429 - we would have to induce threshold transaction rates and ingress/egress.
Example Project
WIP
What is the current bug behavior?
The API call fails with an Internal Server error (500).
What is the expected correct behavior?
Retry the download from object storage a sensible number of times. This is the advice given to us from Microsoft Support as well, who met with us during one of the customer calls dealing with this issue. The customer has made it clear that rate limiting occurs often, and therefore we should investigate whether more resilience to these conditions is possible.