Geo: Adding a secondary node may slow down the primary node

Problem to solve

As can happen from time to time, depending on usage and infrastructure, a GitLab instance can get a little slow.

Add a Geo secondary. This adds load to the original instance. Even if we avoid streaming DB replication, the secondary will begin backfilling, which adds a constant amount of load, up to the concurrency settings.

For small instances, the default concurrency settings can easily be too high. I believe at least one customer adding Geo suffered from this problem making their original instance unusable for a time.

Intended users

Further details

Proposal

  1. Set default concurrency settings to the lowest common denominator, appropriate for 1k ref arch
  2. Add a final step to Geo setup docs: "Tune concurrency settings". Link to Tuning Geo.
  3. Open follow up to improve/expand details in Tuning Geo.

Implementation guide

Change these defaults:

t.integer "files_max_capacity", default: 10, null: false
t.integer "repos_max_capacity", default: 25, null: false
t.integer "verification_max_capacity", default: 100, null: false
t.integer "container_repositories_max_capacity", default: 10, null: false
t.integer "minimum_reverification_interval", default: 7, null: false

to:

t.integer "files_max_capacity", default: 10, null: false
t.integer "repos_max_capacity", default: 10, null: false
t.integer "verification_max_capacity", default: 10, null: false
t.integer "container_repositories_max_capacity", default: 2, null: false
t.integer "minimum_reverification_interval", default: 90, null: false

And then add to https://docs.gitlab.com/ee/administration/geo/replication/tuning.html something like:

Since GitLab 17.X (whatever version the above gets released in), Geo's performance settings are set to low defaults for most environments, in order to avoid excessive load when setting up new Geo sites. You are expected to increase these settings in most cases. You can do this safely like so:

  1. Watch progress bar changes on Admin > Geo > Sites
  2. Decide which data types are progressing too slowly
  3. Watch load metrics of the primary and secondary sites
  4. Increase concurrency limits by 10 to be conservative
  5. Watch changes in progress and load metrics for at least 3 minutes
  6. Repeat, until either load metrics reach your desired maximum, or syncing and verification is progressing as quickly as desired.

Permissions and Security

Documentation

Testing

What does success look like, and how can we measure that?

What is the type of buyer?

Links / references

Edited by Michael Kozono