History timeout for files in large repo

Summary

We have a large repo, about 1.5 times the size of Linux Kernel.

When our users were to browse some history in said repo, the browser load and eventually timeout with 500 or 502 error displayed

Steps to reproduce

Navigate to history of a file a/b/c/d/e.xml

Example Project

N/A. We cannot share our monorepo externally

What is the current bug behavior?

500 or 502 being displayed on UI

What is the expected correct behavior?

User should be able to see history

Relevant logs and/or screenshots

image

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info

(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)


System information
System:         CentOS 7.9.2009
Proxy:          http_proxy: http://proxy
                https_proxy: http://proxy
                no_proxy: 127.0.0.1,localhost,az1,az2,support.org.com,org.com
Current User:   git
Using RVM:      no
Ruby Version:   2.7.2p137
Gem Version:    3.1.4
Bundler Version:2.1.4
Rake Version:   13.0.6
Redis Version:  6.0.14
Git Version:    2.32.0
Sidekiq Version:5.2.9
Go Version:     unknown

GitLab information
Version:        14.2.5-ee
Revision:       72c1da0383a
Directory:      /opt/gitlab/embedded/service/gitlab-rails
DB Adapter:     PostgreSQL
DB Version:     12.8
URL:            https://gitlab.org.com
HTTP Clone URL: https://gitlab.org.com/some-group/some-project.git
SSH Clone URL:  git@gitlab.org.com:some-group/some-project.git
Elasticsearch:  no
Geo:            no
Using LDAP:     yes
Using Omniauth: yes
Omniauth Providers: saml

GitLab Shell
Version:        13.19.1
Repository storage paths:
- default:      /var/opt/gitlab/git-data/repositories
- gitaly-pool-001:      /this/is/ignored/but/required/by/chefrecipe/repositories
GitLab Shell path:              /opt/gitlab/embedded/service/gitlab-shell
Git:            /opt/gitlab/embedded/bin/git


Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of: sudo gitlab-rake gitlab:check SANITIZE=true)

(For installations from source run and paste the output of: sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)

(we will only investigate if the tests are passing)

Not relevant

Possible fixes

This is caused by Gitlab Rails calling FindCommits GRPC to Gitaly to get all possible commits related to a file path.

Ultimately, in Gitaly, this is translated to the following git called

git log --follow -- a/b/c/d/e.xml

This not only try to detect all the commits that touched this file, it also tried to track the history through all the possible renames, which is slow. Here is a benchmark I run on our monorepo on my local laptop (with Git Commit-graph + Bloom filtered prepared).

trunk ~/monorepo> hyperfine 'git log --follow -- a/b/c/d/e.xml'
Benchmark 1: git log --follow -- a/b/c/d/e.xml
  Time (mean ± σ):     92.327 s ±  9.180 s    [User: 84.668 s, System: 3.867 s]
  Range (min … max):   84.270 s … 110.414 s    10 runs

trunk ~/monorepo> hyperfine 'git log -- a/b/c/d/e.xml'
Benchmark 1: git log -- a/b/c/d/e.xml
  Time (mean ± σ):      1.449 s ±  0.031 s    [User: 1.100 s, System: 0.324 s]
  Range (min … max):    1.418 s …  1.522 s    10 runs

So when Gitaly gRPC takes 2 mins to response, the Rails layer signaled timeout to FindCommits grpc and returned 502 to our user after 1 minute.

The code track on rails started here https://sourcegraph.com/gitlab.com/gitlab-org/gitlab@v14.2.5-ee/-/blob/app/controllers/projects/commits_controller.rb?L77 and ended with calling Project.repository.log(options) which triggered gitaly gRPC and then timed out.

Here is the Gitaly gRPC handler that will add --follow to the git log call https://sourcegraph.com/gitlab.com/gitlab-org/gitaly@v14.2.5/-/blob/internal/gitaly/service/commit/find_commits.go?L200

There are 2 possible fixes here:

  1. Make file/dir history on Gitlab UI to NOT use --follow by default and provide a checkbox for users to enabled --follow manually. Then, when user enabled --follow, use an gRPC with a more generous timeout with possibly streaming enabled instead of request/response.

  2. Create a metadata in Git commit-graph to help track and speed up rename detection when run git log --follow -- some/file.ext

cc: @jacobvosmaer-gitlab as you have touched this code path previously.

Additional details of possible implementations

1. Batching and time limited lookups

In order to make this more resilient, we can use multiple requests on the frontend and the backend can fetch the history in batches.

First request would fetch up to a certain point in time. In the response, it would carry the old time looked up

The frontend would then pass that time in the 1st response as parameter so the backend can start fetching from that point backwards.

Not clear how would the backend stop fetching or how long should the period be.

See thread: #344549 (comment 757346304)

Edited by André Luís