Skip to content

Gitaly spikes to 100% CPU usage when a moved project's branch details are called via API

Summary

This will affect all environments when customers move workload in different shards. We did find this out while testing out our new 50k Reference Architecture I noticed that one of the 4 Gitaly nodes was spiking to 100% when performance testing was done against the Repository Branch Details API.

image

http://50k.testbed.gitlab.net/-/grafana/dashboard/snapshot/KDpnPrAq5Cj3sH75cGenTJE400E3G0X8

In this setup we have 4 identical Gitaly nodes (64 CPU, 240GB Memory) that each have a copy of the same large project on each. Our test tool then tests the environment at a rate of 1000 RPS and makes calls equally to each of the 4 project copies (so approximately 250 RPS to each Gitaly node). The other copies of the prohect only should around 20% usage.

The only difference that applied here is how the copy of the project was imported to the affected Gitaly node - It was specifically moved from another Gitaly node via the Project Edit API where as all the others were imported to their Gitaly nodes directly via UI settings.

To verify this I imported another copy of the project today in the same way to another Gitaly node and was able to reproduce the issue successfully.

Of note the affected node shows numerous Git operations (details in the logs section below). Interestingly not all Linux task managers showed these Git operations but top did and they were averaging around 20% CPU usage.

Steps to reproduce

  1. Import a Project to an environment that has at least 2 Gitaly nodes (API was used in this case and the specific project is a copy of gitlab-foss).
  2. Move the project to the other Gitaly node via the Project Edit API.
  3. Set up a way to monitor the CPU use on the Gitaly node that now contains the project (e.g. Prometheus or even just watching top on the node directly)
  4. Prepare the GitLab Performance Tool and the proceed to run the api_v4_projects_repository_branches_branch test against the environment. The RPS rate will be dependent on your environment but 20s_20rps should do, e.g.
    • ACCESS_TOKEN=<token> ./run-k6 -e environments/<environment>.json -s scenarios/20s_20rps.json -t tests/api_v4_projects_repository_branches_branch.js
  5. As the test is running note that the CPU usage spikes massively to 100%.

Example Project

We use a copy of gitlab-foss. It's specifically a backup of our own backup on GitHub, gitlabhq, as that's sanitised and doesn't contain private data.

What is the current bug behavior?

Gitaly CPU usage spikes massively to 100% when the Repository Branch Details API endpoint is polled for a project that's been moved to the Gitaly node via the Project Edit API.

What is the expected correct behavior?

That the Gitaly node shows the same CPU usage as others where the project is the same but hasn't been moved via the Project Edit API.

Relevant logs and/or screenshots

Perf Output, taken over a couple of minutes as a long version of the test was running:

Samples: 59M of event 'cpu-clock', Event count (approx.): 14942880250000
Overhead  Command  Shared Object           Symbol
  27.56%  git      libz.so.1.2.11          [.] inflate
  21.79%  git      libz.so.1.2.11          [.] inflate_table
   9.28%  git      libz.so.1.2.11          [.] inflate_fast
   3.59%  git      libc-2.27.so            [.] 0x000000000018adb8
   3.53%  git      git                     [.] lookup_object
   2.27%  git      libz.so.1.2.11          [.] adler32_z
   1.69%  git      git                     [.] clear_commit_marks_1
   1.61%  git      git                     [.] create_object
   1.49%  git      git                     [.] hashmap_get
   1.40%  git      git                     [.] parse_commit_buffer
   1.16%  git      git                     [.] nth_packed_object_offset
   1.13%  git      [kernel.kallsyms]       [k] __do_page_fault
   1.09%  git      [kernel.kallsyms]       [k] clear_page_erms
   0.91%  git      git                     [.] get_sha1_hex
   0.61%  git      git                     [.] bsearch_hash
   0.50%  git      git                     [.] compare_commits_by_gen_then_commit_date
   0.46%  git      libc-2.27.so            [.] 0x00000000000950dd
   0.43%  git      [kernel.kallsyms]       [k] _raw_spin_unlock_irqrestore
   0.42%  git      git                     [.] unpack_object_header_buffer
   0.40%  git      [kernel.kallsyms]       [k] unmap_page_range
   0.39%  git      [kernel.kallsyms]       [k] get_page_from_freelist
   0.37%  git      libc-2.27.so            [.] malloc
   0.37%  git      [kernel.kallsyms]       [k] filemap_map_pages
   0.36%  git      git                     [.] prio_queue_get
   0.36%  git      [kernel.kallsyms]       [k] try_charge
   0.35%  git      git                     [.] paint_down_to_common
   0.35%  git      [kernel.kallsyms]       [k] __d_lookup_rcu
   0.34%  git      git                     [.] prio_queue_put
   0.29%  git      [kernel.kallsyms]       [k] do_syscall_64

Go Tool pprof output. Top 30 Cumulative:

Showing nodes accounting for 4.24s, 23.30% of 18.20s total
Dropped 581 nodes (cum <= 0.09s)
Showing top 30 nodes out of 272
      flat  flat%   sum%        cum   cum%
         0     0%     0%      7.48s 41.10%  google.golang.org/grpc.(*Server).serveStreams.func1.1
         0     0%     0%      7.47s 41.04%  google.golang.org/grpc.(*Server).handleStream
     0.01s 0.055% 0.055%      7.47s 41.04%  google.golang.org/grpc.(*Server).processUnaryRPC
         0     0% 0.055%      6.93s 38.08%  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1
         0     0% 0.055%      6.92s 38.02%  github.com/grpc-ecosystem/go-grpc-middleware/tags.UnaryServerInterceptor.func1
     0.01s 0.055%  0.11%      6.84s 37.58%  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1
         0     0%  0.11%      6.84s 37.58%  gitlab.com/gitlab-org/labkit/correlation/grpc.UnaryServerCorrelationInterceptor.func1
         0     0%  0.11%      6.82s 37.47%  gitlab.com/gitlab-org/gitaly/internal/middleware/metadatahandler.UnaryInterceptor
         0     0%  0.11%      6.75s 37.09%  github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1
     0.01s 0.055%  0.16%      6.64s 36.48%  github.com/grpc-ecosystem/go-grpc-middleware/logging/logrus.UnaryServerInterceptor.func1
         0     0%  0.16%      5.75s 31.59%  github.com/grpc-ecosystem/go-grpc-middleware/auth.UnaryServerInterceptor.func1
         0     0%  0.16%      5.75s 31.59%  gitlab.com/gitlab-org/gitaly/internal/middleware/cancelhandler.Unary
         0     0%  0.16%      5.75s 31.59%  gitlab.com/gitlab-org/gitaly/internal/middleware/limithandler.(*LimiterMiddleware).UnaryInterceptor.func1
         0     0%  0.16%      5.75s 31.59%  gitlab.com/gitlab-org/gitaly/internal/middleware/sentryhandler.UnaryLogHandler
         0     0%  0.16%      5.64s 30.99%  github.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1
     0.02s  0.11%  0.27%      5.61s 30.82%  gitlab.com/gitlab-org/gitaly/internal/middleware/cache.UnaryInvalidator.func1
     0.01s 0.055%  0.33%      5.58s 30.66%  gitlab.com/gitlab-org/gitaly/internal/middleware/panichandler.UnaryPanicHandler
         0     0%  0.33%      4.06s 22.31%  gitlab.com/gitlab-org/gitaly/proto/go/gitalypb._RefService_FindBranch_Handler
         0     0%  0.33%      3.87s 21.26%  runtime.mcall
     3.84s 21.10% 21.43%      3.84s 21.10%  runtime.futex
     0.04s  0.22% 21.65%      3.76s 20.66%  runtime.schedule
         0     0% 21.65%      3.55s 19.51%  gitlab.com/gitlab-org/gitaly/proto/go/gitalypb._RefService_FindBranch_Handler.func1
     0.02s  0.11% 21.76%      3.54s 19.45%  gitlab.com/gitlab-org/gitaly/internal/service/ref.(*server).FindBranch
     0.26s  1.43% 23.19%      3.42s 18.79%  runtime.findrunnable
     0.01s 0.055% 23.24%      3.41s 18.74%  gitlab.com/gitlab-org/gitaly/internal/command.New
         0     0% 23.24%      3.41s 18.74%  gitlab.com/gitlab-org/gitaly/internal/git.BareCommand
     0.01s 0.055% 23.30%      3.17s 17.42%  runtime.futexsleep
         0     0% 23.30%      2.97s 16.32%  runtime.park_m
         0     0% 23.30%      2.34s 12.86%  gitlab.com/gitlab-org/gitaly/internal/git.Command
         0     0% 23.30%      2.19s 12.03%  gitlab.com/gitlab-org/gitaly/proto/go/gitalypb._CommitService_CommitIsAncestor_Handler

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info
System information
System:		Ubuntu 18.04
Proxy:		no
Current User:	git
Using RVM:	no
Ruby Version:	2.6.3p62
Gem Version:	2.7.9
Bundler Version:1.17.3
Rake Version:	12.3.2
Redis Version:	3.2.12
Git Version:	2.22.0
Sidekiq Version:5.2.7
Go Version:	unknown

GitLab information
Version:	12.3.5-ee
Revision:	9dbaa740018
Directory:	/opt/gitlab/embedded/service/gitlab-rails
DB Adapter:	PostgreSQL
DB Version:	10.9
URL:		http://50k.testbed.gitlab.net
HTTP Clone URL:	http://50k.testbed.gitlab.net/some-group/some-project.git
SSH Clone URL:	git@50k.testbed.gitlab.net:some-group/some-project.git
Elasticsearch:	no
Geo:		no
Using LDAP:	no
Using Omniauth:	yes
Omniauth Providers:

GitLab Shell
Version:	10.0.0
Repository storage paths:
- default: 	/var/opt/gitlab/git-data/repositories
- storage1: 	/var/opt/gitlab/git-data/repositories
- storage2: 	/var/opt/gitlab/git-data/repositories
- storage3: 	/var/opt/gitlab/git-data/repositories
- storage4: 	/var/opt/gitlab/git-data/repositories
GitLab Shell path:		/opt/gitlab/embedded/service/gitlab-shell
Git:		/opt/gitlab/embedded/bin/git

Results of GitLab application Check

Expand for output related to the GitLab application check
Checking GitLab subtasks ...

Checking GitLab Shell ...

GitLab Shell: ... GitLab Shell version >= 10.0.0 ? ... OK (10.0.0)
Running /opt/gitlab/embedded/service/gitlab-shell/bin/check
Check GitLab API access: OK
Redis available via internal API: OK

gitlab-shell self-check successful

Checking GitLab Shell ... Finished

Checking Gitaly ...

Gitaly: ... default ... OK
storage1 ... OK
storage2 ... OK
storage3 ... OK
storage4 ... OK

Checking Gitaly ... Finished

Checking Sidekiq ...

Sidekiq: ... Running? ... no
  Try fixing it:
  sudo -u git -H RAILS_ENV=production bin/background_jobs start
  For more information see:
  doc/install/installation.md in section "Install Init Script"
  see log/sidekiq.log for possible errors
  Please fix the error above and rerun the checks.

Checking Sidekiq ... Finished

Checking Incoming Email ...

Incoming Email: ... Reply by email is disabled in config/gitlab.yml

Checking Incoming Email ... Finished

Checking LDAP ...

LDAP: ... LDAP is disabled in config/gitlab.yml

Checking LDAP ... Finished

Checking GitLab App ...

Git configured correctly? ... yes
Database config exists? ... yes
All migrations up? ... yes
Database contains orphaned GroupMembers? ... no
GitLab config exists? ... yes
GitLab config up to date? ... yes
Log directory writable? ... yes
Tmp directory writable? ... yes
Uploads directory exists? ... yes
Uploads directory has correct permissions? ... yes
Uploads directory tmp has correct permissions? ... yes
Init script exists? ... skipped (omnibus-gitlab has no init script)
Init script up-to-date? ... skipped (omnibus-gitlab has no init script)
Projects have namespace: ...
2/1 ... yes
2/19 ... yes
2/20 ... yes
2/31 ... yes
2/33 ... yes
21/35 ... yes
2/43 ... yes
2/44 ... yes
1/45 ... yes
1/46 ... yes
Redis version >= 2.8.0? ... yes
Ruby version >= 2.5.3 ? ... yes (2.6.3)
Git version >= 2.22.0 ? ... yes (2.22.0)
Git user has default SSH configuration? ... yes
Active users: ... 1
Is authorized keys file accessible? ... yes
Elasticsearch version 5.6 - 6.x? ... skipped (elasticsearch is disabled)

Checking GitLab App ... Finished


Checking GitLab subtasks ... Finished

Note that the Sidekiq error above is due to this being a HA environment where they run on separate nodes.

Edited by Grant Young
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information