GitLab Geo causes high memory consumption during replication

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

Summary

At least during initial replication, Geo replication workers cause very high memory usage on the Geo primary. The increase in memory usage is significant enough that it causes EC2 instances sized according to the reference architecture to go unresponsive in single-node setups. Memory consumption does seem to settle down after initial replication but I am performing more testing with a client to narrow this down.

Support ticket 374748 is an example where Geo::VerificationBatchWorker and Geo::VerificationStateBackfillWorker seem to be the culprit. It's not totally clear to me why this goes as bad as it does on EC2 (instance has to be hard reset) but I assume the OOM killer is getting individual Sidekiq workers and then Sidekiq is just starting more.

Steps to reproduce

I've had this same issue occur twice so it seems to be somewhat reliable:

Start with a GitLab instance under current use with a fair number of projects and users - it doesn't need to be a lot, most recently this happened with a 50-user GitLab install with 8GB of memory. It did have a fairly large number of CI/CD artifacts (thousands) and a few pretty large projects.
Enable Geo and add a secondary instance.
During initial replication, memory usage on the primary will quickly climb. On EC2 this might get bad enough that the instance stops responding on the network.

Example Project

Only applicable to self-hosted environments, but I may start building a test environment to see if I can replicate this a third time.

What is the current bug behavior?

Memory consumption during initial replication exceeds the ref arch requirements and seems to stay elevated even after initial replication completes (working to collect more info on this).

What is the expected correct behavior?

Although the client I'm working with is happy with just having to use a larger EC2 instance, it seems like a more complete fix would be some kind of protection against too many replication tasks occurring simultaneously, or something along those lines.

Relevant logs and/or screenshots

GitLab SOS perf counts from an affected primary node:

WORKER                                                                COUNT     RPS    P99_ms    P95_ms   MEDIAN_ms    MAX_ms    MIN_ms      SCORE     %FAIL
Geo::VerificationBatchWorker                                           4774    1.41     529.7     344.5       111.0    2007.9       5.7  2529016.5      0.00
Geo::VerificationStateBackfillWorker                                    456    0.13    5440.8    5279.0      5048.3    5544.9    5002.7  2481001.6      0.00
ReactiveCachingWorker                                                   604    0.18    1244.4     738.8       159.9    4470.7       6.5   751607.3      0.00
PipelineProcessWorker                                                   311    0.09    1446.0     618.4        60.4    1822.9       6.2   449693.3      0.00
Geo::ReverificationBatchWorker                                          680    0.20     468.8     273.4        70.9    1082.3       3.4   318750.1      0.00
Geo::VerificationTimeoutWorker                                          679    0.20     466.2     244.0        52.2     887.5       3.7   316534.6      0.00

The following screenshot shows memory saturation via Grafana. The big change point indicated by the arrow is the resizing of the instance from 8GB to 16GB of RAM.

Output of checks

GitLab check and geo check pass normally.

Results of GitLab environment info

Expand for output related to GitLab environment info

# gitlab-rake gitlab:env:info

System information
System:
Proxy:		no
Current User:	git
Using RVM:	no
Ruby Version:	2.7.7p221
Gem Version:	3.1.6
Bundler Version:2.3.15
Rake Version:	13.0.6
Redis Version:	6.2.8
Sidekiq Version:6.5.7
Go Version:	unknown

GitLab information
Version:	15.8.3-ee
Revision:	b6226e16592
Directory:	/opt/gitlab/embedded/service/gitlab-rails
DB Adapter:	PostgreSQL
DB Version:	13.8
URL:		https://[redacted]
HTTP Clone URL:	https://[redacted]/some-group/some-project.git
SSH Clone URL:	git@[redacted]:some-group/some-project.git
Elasticsearch:	no
Geo:		yes
Geo node:	Primary
Using LDAP:	no
Using Omniauth:	yes
Omniauth Providers: azure_activedirectory_v2

GitLab Shell
Version:	14.15.0
Repository storages:
- default: 	unix:/var/opt/gitlab/gitaly/gitaly.socket
GitLab Shell path:		/opt/gitlab/embedded/service/gitlab-shell

Results of GitLab application Check

Expand for output related to the GitLab application check

# gitlab-rake gitlab:check SANITIZE=true

Checking GitLab subtasks ...

Checking GitLab Shell ...

GitLab Shell: ... GitLab Shell version >= 14.15.0 ? ... OK (14.15.0) Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Internal API available: OK Redis available via internal API: OK gitlab-shell self-check successful

Checking GitLab Shell ... Finished

Checking Gitaly ...

Gitaly: ... default ... OK

Checking Gitaly ... Finished

Checking Sidekiq ...

Sidekiq: ... Running? ... yes Number of Sidekiq processes (cluster/worker) ... 1/1

Checking Sidekiq ... Finished

Checking Incoming Email ...

Incoming Email: ... Reply by email is disabled in config/gitlab.yml

Checking Incoming Email ... Finished

Checking LDAP ...

LDAP: ... LDAP is disabled in config/gitlab.yml

Checking LDAP ... Finished

Checking GitLab App ...

Database config exists? ... yes All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config up to date? ... yes Cable config exists? ... yes Resque config exists? ... yes Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory exists? ... yes Uploads directory has correct permissions? ... yes Uploads directory tmp has correct permissions? ... skipped (no tmp uploads folder yet) Systemd unit files or init script exist? ... skipped (omnibus-gitlab has neither init script nor systemd units) Systemd unit files or init script up-to-date? ... skipped (omnibus-gitlab has neither init script nor systemd units) Projects have namespace: ... [snipped, all yes] Redis version >= 6.0.0? ... yes Ruby version >= 2.7.2 ? ... yes (2.7.7) Git user has default SSH configuration? ... yes Active users: ... 81 Is authorized keys file accessible? ... skipped (authorized keys not enabled) GitLab configured to store new projects in hashed storage? ... yes All projects are in hashed storage? ... yes Elasticsearch version 7.x-8.x or OpenSearch version 1.x ... skipped (Advanced Search is disabled) All migrations must be finished before doing a major upgrade ... skipped (Advanced Search is disabled)

Checking GitLab App ... Finished

Checking Geo ...

GitLab Geo is available ... GitLab Geo is enabled ... yes This machine's Geo node name matches a database record ... yes, found a primary node named "us-east-2" HTTP/HTTPS repository cloning is enabled ... yes Machine clock is synchronized ... warning Reason: Connection to the NTP Server pool.ntp.org took more than 60 seconds (Timeout) Try fixing it: Check whether you have a connectivity problem or if there is a firewall blocking it If this is an offline environment, you can ignore this error, but make sure you have a way to keep clocks synced Git user has default SSH configuration? ... yes OpenSSH configured to use AuthorizedKeysCommand ... skipped Reason: Cannot access OpenSSH configuration file Try fixing it: This is expected if you are using SELinux. You may want to check configuration manually For more information see: doc/administration/operations/fast_ssh_key_lookup.md GitLab configured to disable writing to authorized_keys file ... yes GitLab configured to store new projects in hashed storage? ... yes All projects are in hashed storage? ... yes

Checking Geo ... Finished

Checking GitLab subtasks ... Finished

Possible fixes

Edited Aug 28, 2025 by 🤖 GitLab Bot 🤖