Gitaly spikes to 100% CPU usage when a moved project's branch details are called via API
Summary
This will affect all environments when customers move workload in different shards. We did find this out while testing out our new 50k Reference Architecture I noticed that one of the 4 Gitaly nodes was spiking to 100% when performance testing was done against the Repository Branch Details API.
http://50k.testbed.gitlab.net/-/grafana/dashboard/snapshot/KDpnPrAq5Cj3sH75cGenTJE400E3G0X8
In this setup we have 4 identical Gitaly nodes (64 CPU, 240GB Memory) that each have a copy of the same large project on each. Our test tool then tests the environment at a rate of 1000 RPS and makes calls equally to each of the 4 project copies (so approximately 250 RPS to each Gitaly node). The other copies of the prohect only should around 20% usage.
The only difference that applied here is how the copy of the project was imported to the affected Gitaly node - It was specifically moved from another Gitaly node via the Project Edit API where as all the others were imported to their Gitaly nodes directly via UI settings.
To verify this I imported another copy of the project today in the same way to another Gitaly node and was able to reproduce the issue successfully.
Of note the affected node shows numerous Git operations (details in the logs section below). Interestingly not all Linux task managers showed these Git operations but top did and they were averaging around 20% CPU usage.
Steps to reproduce
- Import a Project to an environment that has at least 2 Gitaly nodes (API was used in this case and the specific project is a copy of
gitlab-foss). - Move the project to the other Gitaly node via the Project Edit API.
- Set up a way to monitor the CPU use on the Gitaly node that now contains the project (e.g. Prometheus or even just watching
topon the node directly) -
Prepare the GitLab Performance Tool and the proceed to run the
api_v4_projects_repository_branches_branchtest against the environment. The RPS rate will be dependent on your environment but20s_20rpsshould do, e.g.ACCESS_TOKEN=<token> ./run-k6 -e environments/<environment>.json -s scenarios/20s_20rps.json -t tests/api_v4_projects_repository_branches_branch.js
- As the test is running note that the CPU usage spikes massively to 100%.
Example Project
We use a copy of gitlab-foss. It's specifically a backup of our own backup on GitHub, gitlabhq, as that's sanitised and doesn't contain private data.
What is the current bug behavior?
Gitaly CPU usage spikes massively to 100% when the Repository Branch Details API endpoint is polled for a project that's been moved to the Gitaly node via the Project Edit API.
What is the expected correct behavior?
That the Gitaly node shows the same CPU usage as others where the project is the same but hasn't been moved via the Project Edit API.
Relevant logs and/or screenshots
Perf Output, taken over a couple of minutes as a long version of the test was running:
Samples: 59M of event 'cpu-clock', Event count (approx.): 14942880250000
Overhead Command Shared Object Symbol
27.56% git libz.so.1.2.11 [.] inflate
21.79% git libz.so.1.2.11 [.] inflate_table
9.28% git libz.so.1.2.11 [.] inflate_fast
3.59% git libc-2.27.so [.] 0x000000000018adb8
3.53% git git [.] lookup_object
2.27% git libz.so.1.2.11 [.] adler32_z
1.69% git git [.] clear_commit_marks_1
1.61% git git [.] create_object
1.49% git git [.] hashmap_get
1.40% git git [.] parse_commit_buffer
1.16% git git [.] nth_packed_object_offset
1.13% git [kernel.kallsyms] [k] __do_page_fault
1.09% git [kernel.kallsyms] [k] clear_page_erms
0.91% git git [.] get_sha1_hex
0.61% git git [.] bsearch_hash
0.50% git git [.] compare_commits_by_gen_then_commit_date
0.46% git libc-2.27.so [.] 0x00000000000950dd
0.43% git [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0.42% git git [.] unpack_object_header_buffer
0.40% git [kernel.kallsyms] [k] unmap_page_range
0.39% git [kernel.kallsyms] [k] get_page_from_freelist
0.37% git libc-2.27.so [.] malloc
0.37% git [kernel.kallsyms] [k] filemap_map_pages
0.36% git git [.] prio_queue_get
0.36% git [kernel.kallsyms] [k] try_charge
0.35% git git [.] paint_down_to_common
0.35% git [kernel.kallsyms] [k] __d_lookup_rcu
0.34% git git [.] prio_queue_put
0.29% git [kernel.kallsyms] [k] do_syscall_64
Go Tool pprof output. Top 30 Cumulative:
Showing nodes accounting for 4.24s, 23.30% of 18.20s total
Dropped 581 nodes (cum <= 0.09s)
Showing top 30 nodes out of 272
flat flat% sum% cum cum%
0 0% 0% 7.48s 41.10% google.golang.org/grpc.(*Server).serveStreams.func1.1
0 0% 0% 7.47s 41.04% google.golang.org/grpc.(*Server).handleStream
0.01s 0.055% 0.055% 7.47s 41.04% google.golang.org/grpc.(*Server).processUnaryRPC
0 0% 0.055% 6.93s 38.08% github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1
0 0% 0.055% 6.92s 38.02% github.com/grpc-ecosystem/go-grpc-middleware/tags.UnaryServerInterceptor.func1
0.01s 0.055% 0.11% 6.84s 37.58% github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1
0 0% 0.11% 6.84s 37.58% gitlab.com/gitlab-org/labkit/correlation/grpc.UnaryServerCorrelationInterceptor.func1
0 0% 0.11% 6.82s 37.47% gitlab.com/gitlab-org/gitaly/internal/middleware/metadatahandler.UnaryInterceptor
0 0% 0.11% 6.75s 37.09% github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1
0.01s 0.055% 0.16% 6.64s 36.48% github.com/grpc-ecosystem/go-grpc-middleware/logging/logrus.UnaryServerInterceptor.func1
0 0% 0.16% 5.75s 31.59% github.com/grpc-ecosystem/go-grpc-middleware/auth.UnaryServerInterceptor.func1
0 0% 0.16% 5.75s 31.59% gitlab.com/gitlab-org/gitaly/internal/middleware/cancelhandler.Unary
0 0% 0.16% 5.75s 31.59% gitlab.com/gitlab-org/gitaly/internal/middleware/limithandler.(*LimiterMiddleware).UnaryInterceptor.func1
0 0% 0.16% 5.75s 31.59% gitlab.com/gitlab-org/gitaly/internal/middleware/sentryhandler.UnaryLogHandler
0 0% 0.16% 5.64s 30.99% github.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1
0.02s 0.11% 0.27% 5.61s 30.82% gitlab.com/gitlab-org/gitaly/internal/middleware/cache.UnaryInvalidator.func1
0.01s 0.055% 0.33% 5.58s 30.66% gitlab.com/gitlab-org/gitaly/internal/middleware/panichandler.UnaryPanicHandler
0 0% 0.33% 4.06s 22.31% gitlab.com/gitlab-org/gitaly/proto/go/gitalypb._RefService_FindBranch_Handler
0 0% 0.33% 3.87s 21.26% runtime.mcall
3.84s 21.10% 21.43% 3.84s 21.10% runtime.futex
0.04s 0.22% 21.65% 3.76s 20.66% runtime.schedule
0 0% 21.65% 3.55s 19.51% gitlab.com/gitlab-org/gitaly/proto/go/gitalypb._RefService_FindBranch_Handler.func1
0.02s 0.11% 21.76% 3.54s 19.45% gitlab.com/gitlab-org/gitaly/internal/service/ref.(*server).FindBranch
0.26s 1.43% 23.19% 3.42s 18.79% runtime.findrunnable
0.01s 0.055% 23.24% 3.41s 18.74% gitlab.com/gitlab-org/gitaly/internal/command.New
0 0% 23.24% 3.41s 18.74% gitlab.com/gitlab-org/gitaly/internal/git.BareCommand
0.01s 0.055% 23.30% 3.17s 17.42% runtime.futexsleep
0 0% 23.30% 2.97s 16.32% runtime.park_m
0 0% 23.30% 2.34s 12.86% gitlab.com/gitlab-org/gitaly/internal/git.Command
0 0% 23.30% 2.19s 12.03% gitlab.com/gitlab-org/gitaly/proto/go/gitalypb._CommitService_CommitIsAncestor_Handler
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
System information System: Ubuntu 18.04 Proxy: no Current User: git Using RVM: no Ruby Version: 2.6.3p62 Gem Version: 2.7.9 Bundler Version:1.17.3 Rake Version: 12.3.2 Redis Version: 3.2.12 Git Version: 2.22.0 Sidekiq Version:5.2.7 Go Version: unknown GitLab information Version: 12.3.5-ee Revision: 9dbaa740018 Directory: /opt/gitlab/embedded/service/gitlab-rails DB Adapter: PostgreSQL DB Version: 10.9 URL: http://50k.testbed.gitlab.net HTTP Clone URL: http://50k.testbed.gitlab.net/some-group/some-project.git SSH Clone URL: git@50k.testbed.gitlab.net:some-group/some-project.git Elasticsearch: no Geo: no Using LDAP: no Using Omniauth: yes Omniauth Providers: GitLab Shell Version: 10.0.0 Repository storage paths: - default: /var/opt/gitlab/git-data/repositories - storage1: /var/opt/gitlab/git-data/repositories - storage2: /var/opt/gitlab/git-data/repositories - storage3: /var/opt/gitlab/git-data/repositories - storage4: /var/opt/gitlab/git-data/repositories GitLab Shell path: /opt/gitlab/embedded/service/gitlab-shell Git: /opt/gitlab/embedded/bin/git
Results of GitLab application Check
Expand for output related to the GitLab application check
Checking GitLab subtasks ... Checking GitLab Shell ... GitLab Shell: ... GitLab Shell version >= 10.0.0 ? ... OK (10.0.0) Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Check GitLab API access: OK Redis available via internal API: OK gitlab-shell self-check successful Checking GitLab Shell ... Finished Checking Gitaly ... Gitaly: ... default ... OK storage1 ... OK storage2 ... OK storage3 ... OK storage4 ... OK Checking Gitaly ... Finished Checking Sidekiq ... Sidekiq: ... Running? ... no Try fixing it: sudo -u git -H RAILS_ENV=production bin/background_jobs start For more information see: doc/install/installation.md in section "Install Init Script" see log/sidekiq.log for possible errors Please fix the error above and rerun the checks. Checking Sidekiq ... Finished Checking Incoming Email ... Incoming Email: ... Reply by email is disabled in config/gitlab.yml Checking Incoming Email ... Finished Checking LDAP ... LDAP: ... LDAP is disabled in config/gitlab.yml Checking LDAP ... Finished Checking GitLab App ... Git configured correctly? ... yes Database config exists? ... yes All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config up to date? ... yes Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory exists? ... yes Uploads directory has correct permissions? ... yes Uploads directory tmp has correct permissions? ... yes Init script exists? ... skipped (omnibus-gitlab has no init script) Init script up-to-date? ... skipped (omnibus-gitlab has no init script) Projects have namespace: ... 2/1 ... yes 2/19 ... yes 2/20 ... yes 2/31 ... yes 2/33 ... yes 21/35 ... yes 2/43 ... yes 2/44 ... yes 1/45 ... yes 1/46 ... yes Redis version >= 2.8.0? ... yes Ruby version >= 2.5.3 ? ... yes (2.6.3) Git version >= 2.22.0 ? ... yes (2.22.0) Git user has default SSH configuration? ... yes Active users: ... 1 Is authorized keys file accessible? ... yes Elasticsearch version 5.6 - 6.x? ... skipped (elasticsearch is disabled) Checking GitLab App ... Finished Checking GitLab subtasks ... FinishedNote that the Sidekiq error above is due to this being a HA environment where they run on separate nodes.
