Geo: Sync requests are rate limited by the primary
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Summary
On geo.staging.gitlab.com:
file download requests began hitting a rate limit.
429 Too Many Requestsresponses are being logged as reasons for failure in the Geo secondary's FileDownloadService attempts.
A rate limit is not totally out of the question, but it should be intentional and documented and tunable for Geo purposes.
Steps to reproduce
Increase File synchronization concurrency limit from 10 to 25 on geo.staging.gitlab.com. Sync retries of the 91k missing on primary uploads/artifacts/LFS objects will increase in rate, and will start getting 429s.
Example Project
What is the current bug behavior?
The primary responds to the secondary with 429s sometimes.
What is the expected correct behavior?
The primary should not respond to the secondary with 429s.
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)(we will only investigate if the tests are passing)
Workaround
Workarounds that I can think of are not very good:
- Reduce the concurrency limit, but that will affect non-failing syncs as well.
- Inject a header into requests from the Geo secondary site, and use an HTTP header to bypass rate limiting. (Unfortunately we haven't implemented e.g. an Application Setting to allowlist IPs.)
Possible fixes
We can add a safelist block https://github.com/rack/rack-attack#safelistname-block to https://gitlab.com/gitlab-org/gitlab/-/blob/1f194e562d31712b4b066a8f37246ca2077fde67/lib/gitlab/rack_attack.rb.
The block just needs to return true for authenticated file sync requests from a Geo secondary site, or ideally any authenticated Geo secondary site request.
If there's an unexpected problem with identifying those types of requests, then another option is to safelist IPs in the Application Setting geo_node_allowed_ips, though we'd then have to redo the safelist whenever this setting gets updated.