BranchHooksService (during PostReceive) fetches too many commits for what it needs to do
Our PostReceive
worker runs after we receive changes to a project's git repository. When a branch has been pushed to, it will also run BranchHooksService#execute
, as described in the complicated call stack at #30883 (closed).
In a couple of recent cases on GitLab.com's production site (gitlab-com/gl-infra/production#1536 (closed), gitlab-com/gl-infra/on-call-handovers#214 (closed)), we saw this queue back up. The cause of the backing up appears to have been that some of the jobs in the queue were taking too long, and also taking a lot of CPU time. (CPU time is a problem because of Ruby's global interpreter lock: executing Ruby code in one thread will prevent other threads from executing Ruby code.)
Additional information
The same issue has been opened (#196966) with more information, which identifies a second problem.
Proposed solution
#196966 (see here for more information).
Note that this may not resolve the problem identified inI think that we should reduce the CPU and Gitaly usage of Git::BranchHooksService#commits
by creating two new Gitaly RPCs (or extending existing ones):
- Gets the last N (here, 100) commits between two revisions.
- Counts the total number of commits between two revisions.
Currently we are combining these two operations, which means we fetch an unbounded set of commits from Gitaly, count them, and throw away all but the last 100. In addition to being potentially slower on the Gitaly side, this is also very slow on the Rails side of the equation.
Detailed explanation
The main problem in both of these cases appears to have been pushes that take a very long time to process. In particular, they spent a long time in Git::BranchHooksService#enqueue_process_commit_messages
-> Git::BaseHooksService#limited_commits
-> Git::BranchHooksService#commits
-> Repository#commits_between
Our intent in this service is to get:
- At most 100 commits from the push. (100 is
Git::BaseHooksService::PROCESS_COMMIT_LIMIT
.) - The total number of commits from the push.
However, Git::BranchHooksService#commits
has varying behaviour:
- If we're creating the default branch, we only fetch the last 100 commits from the push. (And override
#commits_count
to make the count correct.) - If we're creating any other branch, we fetch all the commits that diverge between this branch and the default branch.
- If we're updating any branch, we do the same as 2, but compare the current state of this branch to the previous state.
- If we're deleting a branch, we do nothing because we didn't receive any commits.
Case 2 is the big problem here (and potentially case 3 as well, although it's less likely). When the pushed branch diverges significantly from the default branch, we can end up with a lot of commits. We will then:
- Fetch all those commits from Gitaly.
- Decorate them in Ruby (convert to a different object type) in
Repository#commits_between
. - Take the count for
#commits_count
. - In
#limited_commits
we throw away all but the last 100.
This can be really slow and block other jobs from executing! The decoration itself takes a noticeable amount of CPU time. Here's a real case from production where there were 54,000 commits between two branches:
mirror = Project.find(3121189)
Benchmark.bm do |x|
x.report('decorated') { mirror.repository.commits_between('1e9e926c4d15483460bc04a7760078ad2fbce39d', '1b612917e370f4fa6cbcbc8438b0461ff79a4357') }
x.report('raw') { Gitlab::Git::Commit.between(mirror.repository.raw_repository, '1e9e926c4d15483460bc04a7760078ad2fbce39d', '1b612917e370f4fa6cbcbc8438b0461ff79a4357') }
end
This prints:
user system total real
decorated 6.560000 0.260000 6.820000 ( 8.235514)
raw 5.604000 0.144000 5.748000 ( 6.447340)
Which is a full extra second of CPU time just for this job, under non-resource-constrained conditions. And most of that work is unnecessary, as discussed above.