Skip to content

Fix praefect sidechannel proxy

For #5552 (closed)

In some CI job failures, we discover a race condition in Praefect. When the upload-pack command returns an error in stderr, a gRPC error is returned and the content of stderr is copied back to the client. Without Praefect, everything is fine. However, when Praefect is enabled, the content of stderr is not flushed before downstream receives the error. As a result, the client sometimes receives an empty response.

The root cause was due to Sidechannel's proxying function returning early without waiting for connection teardown before moving on. This MR fixes that race by always waiting for the connection to close.

How to verify

As this is a race, it's not easy to replicate this behavior reliably without modifying the implementation of the Sidechannel connection. As a result, additional tests cannot verify the race, unfortunately. Instead, they verify whether the change works when upstream returns an error.

We can intentionally modify the implementation on master branch. The following code snippet adds a tiny delay to each write operation. Rerunning Praefect tests failed with expected errors. After this fix, the failures are gone.

diff --git a/internal/grpc/sidechannel/conn.go b/internal/grpc/sidechannel/conn.go
index f77bbcbe3..591a413e5 100644
--- a/internal/grpc/sidechannel/conn.go
+++ b/internal/grpc/sidechannel/conn.go
@@ -4,6 +4,7 @@ import (
        "fmt"
        "io"
        "net"
+       "time"

        "gitlab.com/gitlab-org/gitaly/v16/internal/git/pktline"
        "gitlab.com/gitlab-org/gitaly/v16/streamio"
@@ -102,6 +103,7 @@ func (cc *ServerConn) Read(p []byte) (n int, err error) {
 // Write writes data to the connection. This method fallbacks to underlying
 // connection without any modificiation.
 func (cc *ServerConn) Write(b []byte) (n int, err error) {
+       time.Sleep(1 * time.Millisecond)
        return cc.conn.Write(b)
 }

Screenshot_2023-09-05_at_11.21.06

Merge request reports