History timeout for files in large repo
Summary
We have a large repo, about 1.5 times the size of Linux Kernel.
When our users were to browse some history in said repo, the browser load and eventually timeout with 500 or 502 error displayed
Steps to reproduce
Navigate to history of a file a/b/c/d/e.xml
Example Project
N/A. We cannot share our monorepo externally
What is the current bug behavior?
500 or 502 being displayed on UI
What is the expected correct behavior?
User should be able to see history
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`) System information System: CentOS 7.9.2009 Proxy: http_proxy: http://proxy https_proxy: http://proxy no_proxy: 127.0.0.1,localhost,az1,az2,support.org.com,org.com Current User: git Using RVM: no Ruby Version: 2.7.2p137 Gem Version: 3.1.4 Bundler Version:2.1.4 Rake Version: 13.0.6 Redis Version: 6.0.14 Git Version: 2.32.0 Sidekiq Version:5.2.9 Go Version: unknown GitLab information Version: 14.2.5-ee Revision: 72c1da0383a Directory: /opt/gitlab/embedded/service/gitlab-rails DB Adapter: PostgreSQL DB Version: 12.8 URL: https://gitlab.org.com HTTP Clone URL: https://gitlab.org.com/some-group/some-project.git SSH Clone URL: git@gitlab.org.com:some-group/some-project.git Elasticsearch: no Geo: no Using LDAP: yes Using Omniauth: yes Omniauth Providers: saml GitLab Shell Version: 13.19.1 Repository storage paths: - default: /var/opt/gitlab/git-data/repositories - gitaly-pool-001: /this/is/ignored/but/required/by/chefrecipe/repositories GitLab Shell path: /opt/gitlab/embedded/service/gitlab-shell Git: /opt/gitlab/embedded/bin/git
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)
Not relevant
Possible fixes
This is caused by Gitlab Rails calling FindCommits GRPC to Gitaly to get all possible commits related to a file path.
Ultimately, in Gitaly, this is translated to the following git called
git log --follow -- a/b/c/d/e.xml
This not only try to detect all the commits that touched this file, it also tried to track the history through all the possible renames, which is slow. Here is a benchmark I run on our monorepo on my local laptop (with Git Commit-graph + Bloom filtered prepared).
trunk ~/monorepo> hyperfine 'git log --follow -- a/b/c/d/e.xml'
Benchmark 1: git log --follow -- a/b/c/d/e.xml
Time (mean ± σ): 92.327 s ± 9.180 s [User: 84.668 s, System: 3.867 s]
Range (min … max): 84.270 s … 110.414 s 10 runs
trunk ~/monorepo> hyperfine 'git log -- a/b/c/d/e.xml'
Benchmark 1: git log -- a/b/c/d/e.xml
Time (mean ± σ): 1.449 s ± 0.031 s [User: 1.100 s, System: 0.324 s]
Range (min … max): 1.418 s … 1.522 s 10 runs
So when Gitaly gRPC takes 2 mins to response, the Rails layer signaled timeout to FindCommits grpc and returned 502 to our user after 1 minute.
The code track on rails started here https://sourcegraph.com/gitlab.com/gitlab-org/gitlab@v14.2.5-ee/-/blob/app/controllers/projects/commits_controller.rb?L77 and ended with calling Project.repository.log(options)
which triggered gitaly gRPC and then timed out.
Here is the Gitaly gRPC handler that will add --follow
to the git log
call https://sourcegraph.com/gitlab.com/gitlab-org/gitaly@v14.2.5/-/blob/internal/gitaly/service/commit/find_commits.go?L200
There are 2 possible fixes here:
-
Make file/dir history on Gitlab UI to NOT use
--follow
by default and provide a checkbox for users to enabled--follow
manually. Then, when user enabled--follow
, use an gRPC with a more generous timeout with possibly streaming enabled instead of request/response. -
Create a metadata in Git commit-graph to help track and speed up rename detection when run
git log --follow -- some/file.ext
cc: @jacobvosmaer-gitlab as you have touched this code path previously.
Additional details of possible implementations
1. Batching and time limited lookups
In order to make this more resilient, we can use multiple requests on the frontend and the backend can fetch the history in batches.
First request would fetch up to a certain point in time. In the response, it would carry the old time looked up
The frontend would then pass that time in the 1st response as parameter so the backend can start fetching from that point backwards.
Not clear how would the backend stop fetching or how long should the period be.
See thread: #344549 (comment 757346304)