Discussion: Use gRPC backchannel to invoke access checks in hooks

Many of the mutating RPCs will perform a set of actions before changes are accepted:

The pre- and post-receive hooks invoke Rails by POSTing /internal/pre-receive and /internal/post-receive APIs. This is used to perform reference counting of in-flight changes in Rails.
The pre-, update and post-receive hooks execute custom hooks that may be installed by the administrator.
After the pre-receive and update hooks ran we invoke Rails via a POST to the /internal/allowed API to verify whether the curernt user is allowed to perform changes. In most cases, this will cause Rails to connect back to Gitaly again in order to enumerate things like new objects.

The POSTs to Rails have proven to be problematic and caused incidents several times already. A simplified version looks like the following:

sequenceDiagram

    Rails->>+Gitaly: UserMergeBranch
    Gitaly->>Gitaly: Compute Merge

    Gitaly->>+Rails: POST /internal/pre-receive
    Rails-->>-Gitaly: Success

    rect rgb(240, 240, 240)
        note right of Rails: Access check

        Gitaly->>+Rails: POST /internal/allowed
        Rails->>+Gitaly: ListAllLFSPointers
        Gitaly-->>-Rails: LFSPointers
        Rails-->>-Gitaly: Success
    end

    Gitaly->>+Rails: POST /internal/post-receive
    Rails-->>-Gitaly: Success

    Gitaly-->>-Rails: Success

The RPC interfaces that Gitaly connects back to on the Rails side are only loosely defined. Consequentially, we have inadvertently changed them at times without realizing the impact on callers.
Rails needs to be very careful for how the repository is assembled that gets used when /internal/allowed connects back to Gitaly, because they also contain a quarantine directory that contains all the changes.
In Praefect-based setups, the quarantine directory is specific per proxied Gitaly. Consequentially, it must be ensured that requests go to the same Gitaly node that has POSTed to /internal/allowed. This requires custom routing logic where Rails sets a specific header to force-route to the primary Gitaly node.
We have observed issues where the originating node of a Gitaly RPC call would be in one deployment environment (e.g. staging-cny), whereas the node that Gitaly connects back to would be in a different deployment environment (e.g. staging). This has caused a lot of confusion.
It is hard for both Rails and Gitaly to correlate the original requests with subsequent requests that are triggered after connecting back to the remote server.

Altogether, this whole architecture is complex, fragile and hard to understand. Furthermore, the fact that we POST back to a GitLab-specific RPC means that Gitaly doesn't work well as a standalone product.

It would be great to find an alternative solution. One such idea was to move access checks into Gitaly, but at several points in time we have noted that this would put too much application-level policy into Gitaly. Furthermore, Gitaly typically doesn't have enough informatino in order to decide on the various policies.

An alternative could be to reuse the "backchannel" concept we already have in the context of transactional voting with Praefect. The caller will set up a gRPC server on the original connection to Gitaly (e.g. UserMergeBranch) that Gitaly can connect back to. If Gitaly finds that the server does exist, it can invoke clearly defined RPC calls on that backchannel connection. This has a bunch of benefits:

We avoid the separate connection to Rails via POST calls. We might thus even be able to completely get rid of the connection details for Rails in Gitaly.
It removes the Rails-specific implementation and thus makes Gitaly more self-contained.
It is a lot easier to correlate the different contexts with each other as Gitaly will reuse the same gRPC connection to connect back to the original node.
As the client-side server would be using gRPC as well the calling interface would be clearly defined and well-documented. This helps to avoid the introduction of breaking changes.

The client-side server definition could look similar to the following:

service AccessCheckService {
  rpc NotifyPreReceive(stream NotifyPreReceiveRequest) returns (NotifyPreReceiveResponse);
  rpc NotifyPostReceive(stream NotifyPostReceiveRequest) returns (NotifyPostReceiveResponse);
  rpc CheckAllowed(CheckAllowedRequest) returns (CheckAllowedResponse);
}

message Change {
  bytes reference = 1;
  bytes old_oid = 2;
  bytes new_oid = 3;
};

message NotifyPreReceiveRequest {
  Repository repository = 1;
  repeated Change changes = 2;
};

message NotifyPreReceiveResponse {
};

message NotifyPostReceiveRequest {
  Repository repository = 1;
  repeated Change changes = 2;
};

message NotifyPostReceiveResponse {
};

message CheckAllowedRequest {
  Repository repository = 1;
  repeated Change changes = 2;
};

message CheckAllowedResponse {
};

Edited May 15, 2023 by Patrick Steinhardt

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information