Geo: Sync requests are rate limited by the primary

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Summary

#294100 (comment 469640023):

On geo.staging.gitlab.com:

file download requests began hitting a rate limit. 429 Too Many Requests responses are being logged as reasons for failure in the Geo secondary's FileDownloadService attempts.

A rate limit is not totally out of the question, but it should be intentional and documented and tunable for Geo purposes.

Steps to reproduce

Increase File synchronization concurrency limit from 10 to 25 on geo.staging.gitlab.com. Sync retries of the 91k missing on primary uploads/artifacts/LFS objects will increase in rate, and will start getting 429s.

Example Project

What is the current bug behavior?

The primary responds to the secondary with 429s sometimes.

What is the expected correct behavior?

The primary should not respond to the secondary with 429s.

Relevant logs and/or screenshots

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info

(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of: sudo gitlab-rake gitlab:check SANITIZE=true)

(For installations from source run and paste the output of: sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)

(we will only investigate if the tests are passing)

Workaround

Workarounds that I can think of are not very good:

  • Reduce the concurrency limit, but that will affect non-failing syncs as well.
  • Inject a header into requests from the Geo secondary site, and use an HTTP header to bypass rate limiting. (Unfortunately we haven't implemented e.g. an Application Setting to allowlist IPs.)

Possible fixes

We can add a safelist block https://github.com/rack/rack-attack#safelistname-block to https://gitlab.com/gitlab-org/gitlab/-/blob/1f194e562d31712b4b066a8f37246ca2077fde67/lib/gitlab/rack_attack.rb.

The block just needs to return true for authenticated file sync requests from a Geo secondary site, or ideally any authenticated Geo secondary site request.

If there's an unexpected problem with identifying those types of requests, then another option is to safelist IPs in the Application Setting geo_node_allowed_ips, though we'd then have to redo the safelist whenever this setting gets updated.

Edited by 🤖 GitLab Bot 🤖