GitLab Geo causes high memory consumption during replication
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Summary
At least during initial replication, Geo replication workers cause very high memory usage on the Geo primary. The increase in memory usage is significant enough that it causes EC2 instances sized according to the reference architecture to go unresponsive in single-node setups. Memory consumption does seem to settle down after initial replication but I am performing more testing with a client to narrow this down.
Support ticket 374748 is an example where Geo::VerificationBatchWorker and Geo::VerificationStateBackfillWorker seem to be the culprit. It's not totally clear to me why this goes as bad as it does on EC2 (instance has to be hard reset) but I assume the OOM killer is getting individual Sidekiq workers and then Sidekiq is just starting more.
Steps to reproduce
I've had this same issue occur twice so it seems to be somewhat reliable:
- Start with a GitLab instance under current use with a fair number of projects and users - it doesn't need to be a lot, most recently this happened with a 50-user GitLab install with 8GB of memory. It did have a fairly large number of CI/CD artifacts (thousands) and a few pretty large projects.
- Enable Geo and add a secondary instance.
- During initial replication, memory usage on the primary will quickly climb. On EC2 this might get bad enough that the instance stops responding on the network.
Example Project
Only applicable to self-hosted environments, but I may start building a test environment to see if I can replicate this a third time.
What is the current bug behavior?
Memory consumption during initial replication exceeds the ref arch requirements and seems to stay elevated even after initial replication completes (working to collect more info on this).
What is the expected correct behavior?
Although the client I'm working with is happy with just having to use a larger EC2 instance, it seems like a more complete fix would be some kind of protection against too many replication tasks occurring simultaneously, or something along those lines.
Relevant logs and/or screenshots
GitLab SOS perf counts from an affected primary node:
WORKER COUNT RPS P99_ms P95_ms MEDIAN_ms MAX_ms MIN_ms SCORE %FAIL
Geo::VerificationBatchWorker 4774 1.41 529.7 344.5 111.0 2007.9 5.7 2529016.5 0.00
Geo::VerificationStateBackfillWorker 456 0.13 5440.8 5279.0 5048.3 5544.9 5002.7 2481001.6 0.00
ReactiveCachingWorker 604 0.18 1244.4 738.8 159.9 4470.7 6.5 751607.3 0.00
PipelineProcessWorker 311 0.09 1446.0 618.4 60.4 1822.9 6.2 449693.3 0.00
Geo::ReverificationBatchWorker 680 0.20 468.8 273.4 70.9 1082.3 3.4 318750.1 0.00
Geo::VerificationTimeoutWorker 679 0.20 466.2 244.0 52.2 887.5 3.7 316534.6 0.00
The following screenshot shows memory saturation via Grafana. The big change point indicated by the arrow is the resizing of the instance from 8GB to 16GB of RAM.
Output of checks
GitLab check and geo check pass normally.
Results of GitLab environment info
Expand for output related to GitLab environment info
# gitlab-rake gitlab:env:info System information System: Proxy: no Current User: git Using RVM: no Ruby Version: 2.7.7p221 Gem Version: 3.1.6 Bundler Version:2.3.15 Rake Version: 13.0.6 Redis Version: 6.2.8 Sidekiq Version:6.5.7 Go Version: unknown GitLab information Version: 15.8.3-ee Revision: b6226e16592 Directory: /opt/gitlab/embedded/service/gitlab-rails DB Adapter: PostgreSQL DB Version: 13.8 URL: https://[redacted] HTTP Clone URL: https://[redacted]/some-group/some-project.git SSH Clone URL: git@[redacted]:some-group/some-project.git Elasticsearch: no Geo: yes Geo node: Primary Using LDAP: no Using Omniauth: yes Omniauth Providers: azure_activedirectory_v2 GitLab Shell Version: 14.15.0 Repository storages: - default: unix:/var/opt/gitlab/gitaly/gitaly.socket GitLab Shell path: /opt/gitlab/embedded/service/gitlab-shell
Results of GitLab application Check
Expand for output related to the GitLab application check
# gitlab-rake gitlab:check SANITIZE=trueChecking GitLab subtasks ...
Checking GitLab Shell ...
GitLab Shell: ... GitLab Shell version >= 14.15.0 ? ... OK (14.15.0) Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Internal API available: OK Redis available via internal API: OK gitlab-shell self-check successful
Checking GitLab Shell ... Finished
Checking Gitaly ...
Gitaly: ... default ... OK
Checking Gitaly ... Finished
Checking Sidekiq ...
Sidekiq: ... Running? ... yes Number of Sidekiq processes (cluster/worker) ... 1/1
Checking Sidekiq ... Finished
Checking Incoming Email ...
Incoming Email: ... Reply by email is disabled in config/gitlab.yml
Checking Incoming Email ... Finished
Checking LDAP ...
LDAP: ... LDAP is disabled in config/gitlab.yml
Checking LDAP ... Finished
Checking GitLab App ...
Database config exists? ... yes All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config up to date? ... yes Cable config exists? ... yes Resque config exists? ... yes Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory exists? ... yes Uploads directory has correct permissions? ... yes Uploads directory tmp has correct permissions? ... skipped (no tmp uploads folder yet) Systemd unit files or init script exist? ... skipped (omnibus-gitlab has neither init script nor systemd units) Systemd unit files or init script up-to-date? ... skipped (omnibus-gitlab has neither init script nor systemd units) Projects have namespace: ... [snipped, all yes] Redis version >= 6.0.0? ... yes Ruby version >= 2.7.2 ? ... yes (2.7.7) Git user has default SSH configuration? ... yes Active users: ... 81 Is authorized keys file accessible? ... skipped (authorized keys not enabled) GitLab configured to store new projects in hashed storage? ... yes All projects are in hashed storage? ... yes Elasticsearch version 7.x-8.x or OpenSearch version 1.x ... skipped (Advanced Search is disabled) All migrations must be finished before doing a major upgrade ... skipped (Advanced Search is disabled)
Checking GitLab App ... Finished
Checking Geo ...
GitLab Geo is available ... GitLab Geo is enabled ... yes This machine's Geo node name matches a database record ... yes, found a primary node named "us-east-2" HTTP/HTTPS repository cloning is enabled ... yes Machine clock is synchronized ... warning Reason: Connection to the NTP Server pool.ntp.org took more than 60 seconds (Timeout) Try fixing it: Check whether you have a connectivity problem or if there is a firewall blocking it If this is an offline environment, you can ignore this error, but make sure you have a way to keep clocks synced Git user has default SSH configuration? ... yes OpenSSH configured to use AuthorizedKeysCommand ... skipped Reason: Cannot access OpenSSH configuration file Try fixing it: This is expected if you are using SELinux. You may want to check configuration manually For more information see: doc/administration/operations/fast_ssh_key_lookup.md GitLab configured to disable writing to authorized_keys file ... yes GitLab configured to store new projects in hashed storage? ... yes All projects are in hashed storage? ... yes
Checking Geo ... Finished
Checking GitLab subtasks ... Finished
