Encoding errors in ListLastCommitsForTree

https://sentry.gitlab.net/gitlab/gitlabcom/issues/644021/?query=is:unresolved%20gitaly

Gitlab::Git::CommandError: 13:grpc: error while marshaling: proto: field "gitaly.ListLastCommitsForTreeResponse.CommitForTree.Path" contains invalid UTF-8

This points to a mistake in the protocol design: https://gitlab.com/gitlab-org/gitaly-proto/blob/a77c232131bf303dabb1bf399b74f951d1cdb23f/commit.proto#L314

  message CommitForTree {
    reserved 1;

    GitCommit commit = 2;
    string path = 3;
  }

This assumes that the path field is always UTF-8 which is not true. Git is much more permissive than that.

The general approach to text encodings in Gitaly is to return them to the client as-is. The client (gitlab-ce) has its own text-mangling code to destructively change text into UTF-8.

So as a solution I propose we update the protocol -- in a backwards compatible way! -- to return paths as type bytes.


Regarding backwards compatibility, what we certainly cannot do here is to update the protocol to be bytes path. The path name is taken already. We need a new field, with a new name, to send the "path as bytes". Then we need to update the clients to use that new field, wait a release, and deprecate the old field.

During the transition we need to send both the existing field path string and the new field. Otherwise old clients get no data anymore during deploys.

Edited by Jacob Vosmaer
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information