Remove archive generation from the Go proxy
This is a follow up to a discussion: !27746 (comment 343643472).
The Go proxy MVC implementation may walk the entire repository tree. This can have a negative impact on performance. Until this is mitigated, the Go proxy is behind a feature flag. Once archive generation is moved into Workhorse or otherwise removed from Rails request processing, the feature flag can be removed.
Additionally, subdirectories containing a
go.modfile must be excluded.That does make this a little trickier to handle. If this were just a simple glob pattern, then I think it's pretty straightforward change. I left some details out. This is the flow:
sequenceDiagram Client->>+Workhorse: GET /api/v4/1234/api/packages/go/module/@v/v1.0.0.zip Workhorse->>+Rails: GET /api/v4/1234/api/packages/go/module/@v/v1.0.0.zip Rails->>+Workhorse: Gitlab-Workhorse-Send-Data Workhorse->>Gitaly: SendArchiveRequest Gitaly->>Git: git archive v1.0.0 Git->>Gitaly: <streamed data> Gitaly->>Workhorse: <streamed data> Workhorse->>Client: v1.0.0.zipRails sends back the
Gitlab-Workhorse-Send-Dataheader that Workhorse recognizes as a request to the Gitaly file servers to rungit archive.It's easy to add a prefix (e.g.
gitlab.com/my/project) before each file because the gRPCSendArchiveRequestalready has aPrefixparameter (https://gitlab.com/gitlab-org/gitaly/blob/f89b33baaa4b34db9444d92466921c1e4a0a66f5/internal/service/repository/archive.go#L96-119) that is passed togit archive --prefix <prefix>.
git archiveapparently works with globs just fine. For example, I was able to do this on the GitLab repo:git archive --prefix "gitlab.com/gitlab-org/" --format zip master "*/**/README.md" > /tmp/test.zipThis generates a
test.zipfile that only hasREADME.mdfiles.However, to fulfill the "subdirectories containing a
go.modfile must be excluded" requirement, we would have to make a second pass through thiszipfile, scan the tree, and delete the files before sending it back. That's probably doable, but we would probably have to teach Gitaly to do this special filtering before sending it back to Workhorse. That being said, it's probably pretty fast to scan the ZIP metadata and make that decision versus trying to do this with Git repository data.
The download API doesn't support the Golang prefix here. I think we'll want to extend the Workhorse code to be able to do this.
archiveParamsincludesArchivePrefix, so the workhorse already supports--prefix; all that needs to change on that front isGitlab::Workhorse.send_git_archive.We would have to make a second pass through this
zipfile, scan the tree, and delete the files before sending it back. That's probably doable, but we would probably have to teach Gitaly to do this special filtering before sending it back to Workhorse.Should Gitaly do this or should Workhorse? It feels a bit weird for Gitaly to have code specific to Go (and less weird for Workhorse to have that). Workhorse will need an additional method either way. Instead of teeing to a temp file and the HTTP stream, Workhorse could copy to the temp file, scan it, then write a filtered version to the HTTP stream.
As an alternative solution, what about using
RepositoryService.SearchFilesByName(Repository#search_files_by_name) to get a complete list of files in the repository? This could be used to construct a list of what files should be archived. This could be done by a custom Workhorse method, or if that Gitaly call is fast enough, perhaps in the Rails app. However,RepositoryService.GetArchiveonly permits a single path argument at the moment. So this would require either a newRepositoryServicemethod (GetArchiveWithPaths), or a modification to support specifying multiple paths forGetArchive, if that can be done in a backwards compatible way.
I think the best approach is:
- List all of the files:
RepositoryService.SearchFilesByName(query: <path>, ref: <ref>)- Identify any
<path>/sub/dir/go.modfiles and create a list of those subdirectories- Request an archive:
RepositoryService.GetArchive(commit_id: <ref>.sha, path: <path>, prefix: <prefix>)- Filter the archive stream on the fly, skipping any entry within one of the subdirectories from step 2
As far as I can tell, the only mechanism for excluding files from
git archiveis with a gitattributes file, which is problematic. The only way I can think of excluding files (without some /tmp/gitattributes hack) is to exhaustively list every file we do want, which could be a problem for large Go repositories. Additionally,RepositoryService.GetArchivedoes not currently support multiple paths. Thus I think it would be better to filter after the fact.This approach would also eliminate the need to scan the
git archive. I believe TAR and ZIP are both streaming formats, so the stream can be rewritten and entries omitted on the fly, without saving it to disk at any point.I see two ways to implement this:
- Create a new Workhorse method for this specific case; or
- Call
Repository#search_files_by_namefromgo_proxy.rband add anExcludePaths []stringparameter to Workhorse'sSendArchivemethodI prefer implementation option 2, because it doesn't involve adding a custom method to Workhorse. I think it may be feasible because
git ls-treein my local GitLab repo runs in a fraction of a second:time git ls-tree --full-tree --name-status -r master | wc -l #> 32786 #> git ls-tree --full-tree --name-status -r master 0.03s user 0.05s system 77% cpu 0.101 total #> wc -l 0.00s user 0.02s system 15% cpu 0.098 total