Proposal: many Gitaly RPC's should be replaced by thin wrappers around Git commands

Status 2021-11-09

This proposal was discussed in a meeting on 2021-11-09 by @mjwood @pks-t @timzallmann and @jacobvosmaer-gitlab. The outcome of the meeting was that this proposal is not practical from an organizational viewpoint because of a preference to keep Git expertise inside the Gitaly team.

The original proposal follows below for reference. The original proposal was authored by @jacobvosmaer-gitlab .

Proposal

Many existing Gitaly RPC's follow a design pattern that is so inefficient that Gitaly itself can cause production incidents. This design pattern forces both Gitaly the application and the Gitaly the team to do unnecessary work. In this document we propose a different approach, that will allow us to have fewer RPC's, that are both easier to maintain and more efficient than the ones they replace.

Gitaly is the Git RPC service that GitLab uses for Git repository storage. When we created Gitaly in 2017-2018 we ended up with a design where Gitaly RPC's are restricted wrappers around Git commands that return an abstract response. For example, we have 13 different RPC's that copy input from the client into a git cat-file process, interpret the process output, and return an abstract response. In this proposal we will argue that we should stop using this design because it is bad for performance and developer productivity.

Instead of having 13 RPC's that feed 13 kinds of input into a git cat-file process and send 13 different kinds of responses, we should have one RPC that runs git cat-file and lets the client interpret the response the way they want to. This one RPC should return the data generated by git cat-file as-is.

Motivation

This proposal is motivated by infrastructure incidents. For as long as we have Gitaly, we have had recurring incidents where traffic on a single Git repository causes a brown-out for an entire Gitaly server. It does not have to be like this.

Through recent other work to make Gitaly more efficient, we have learned that often inefficiencies in Gitaly itself amplify the impact of "heavy" user traffic. Instead of blaming the users, we should address these inefficiencies. The RPC architecture that we propose in this document is designed to get the Git data GitLab needs from Gitaly off of the Gitaly server with the least amount of work.

At the same time, this architecture would improve developer productivity because it puts less work on the shoulders of a small team (Gitaly) using a niche programming language (Go).

Current situation

Gitaly is written in a different language and repository (Go, in gitlab-org/gitaly) from the rest of GitLab's backend (Ruby, in gitlab-org/gitlab). Both of these factors (different languages, different repositories) create developer friction. Go is a minority language at GitLab and there are fewer Gitaly maintainers and developers than gitlab-org/gitlab maintainers and developers. This means that Gitaly is sometimes a development bottleneck.

Gitaly is sometimes also a performance bottleneck. For example, we recently went through a series of incidents on file-43 that were being triggered by very small bursts of FindAllTags RPC requests on a single repository. These requests were being handled so inefficiently by Gitaly that they would cause the whole server (with 32 CPU cores) to slow down for several minutes. The prototype discussed in gitlab-com/gl-infra&600 (closed) showed that the current implementation used 9x more CPU than one based on the architecture we are proposing in this document. That shows the current way of doing things is inefficient. It is also unfair, because traffic on just one repository affected all other repositories on that server.

Ironically, the incidents on file-43 stopped after we did more development work and added FindRefsByOID, the 9th RPC that is a wrapper around git for-each-ref. This required both Gitaly and gitlab-org/gitlab development work. If we had had a generic RPC for git for-each-ref, then the performance optimization that stopped the incidents could have been implemented on the gitlab-org/gitlab side without needing Gitaly developer involvement.

Duplicated RPC's

Git command	RPC's that wrap it
`git cat-file`	13
`git rev-list \| git cat-file`	9
`git for-each-ref`	9
`git diff`	7
`git rev-list`	6
`git for-each-ref \| git cat-file`	4
`git ls-tree`	3

source

The table above shows some of the duplication that is currently going on. For example, we have 9 RPC's that run git for-each-ref:

CalculateChecksum: checksum of git for-each-ref output (probably worth keeping as a dedicated RPC for performance reasons: less data to transfer)
FindAllBranchNames: git for-each-ref --format=%(refname) refs/heads
FindAllTagNames: git for-each-ref --format=%(refname) refs/tags
FindRefsByOID: git for-each-ref --contains $OID
HasLocalBranches: git for-each-ref --count=1 refs/heads
ListBranchNamesContainingCommit: git for-each-ref --format=%(refname) --contains $OID refs/heads
ListRefs: git for-each-ref
ListTagNamesContainingCommit: git for-each-ref --format=%(refname) --contains $OID refs/tags
RefExists: git for-each-ref --count=1 $REF

Almost all of these things could be implemented client-side if we had an RPC that is a thin wrapper around git for-each-ref that allows the caller to pass in allow-listed arguments.

The RPC could be called ForEachRef and it should take two kinds of arguments: a list of flags (things like --count=1) and a list of arguments (things like refs/heads). Because the output depends on the --format flag, Gitaly should not bother to try and parse it. This is also the most efficient way to get the response data off of the Gitaly server.

Proposed solution

We should have Gitaly RPC's that are simpler (just run a Git command) and more flexible, and deprecate a lot of existing RPC's over time.

The new RPC's should:

Take arbitrary command-line arguments that get vetted at runtime by Gitaly against allow-lists. If the caller passes an argument that is not allowed they get an InvalidArgument error. Vetting is necessary for security reasons. Some rework will be necessary to manage the allow-lists over time, but overall rework will be much less than constantly adding new RPC's.
Take an opaque byte stream as standard input: this gives callers maximum flexibility on what they can use the RPC for, and avoids rework where we add new capabilities.
Return standard output as an opaque byte stream. This avoids rework and it is the most efficient way to get data off of the Gitaly server back to the caller.
Be implemented as pipelines of non-gitaly processes, rather than pipelines that weave in and out of gitaly like they do now. This is to prevent noisy neighbor problems where the main gitaly process is too busy handling a single RPC.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information