Investigate canceled requests that are happening on production environment when gitlab-sshd is enabled
What happened
After our recent rollout, we have data to work with:
-
The requests are canceled because SSH connections are closed and Gitaly logs show that the canceling has been initiated by gitlab-sshd: https://log.gprd.gitlab.net/goto/bd25a010-d2a1-11ec-aade-19e9974a7229. When a user deliberately cancels a request or has it successfully executed, the reason for their disconnect is
ssh: disconnect, reason 11: disconnected by user
, while most of the canceled requests haveEOF
: https://log.gprd.gitlab.net/goto/965bbee0-d2a3-11ec-a125-c377a8daf518. That means that the canceling wasn't initiated by a user terminating the operation. -
It seems that it doesn’t happen due to the high load, because the maximum number of concurrent connections is 8 (I performed 100-200 concurrent connections on staging without any significant issues):
- The errors are distributed evenly across all the pods (around 15-20%):
- The duration of the requests is very small as if it just connects to Gitaly and the connection is closed: https://log.gprd.gitlab.net/goto/f1f9e810-d2a0-11ec-aade-19e9974a7229:
Next steps
- Determine whether something between gitlab-sshd and a client may close a connection
- Verify that HAProxy settings for
gstg
are the same as forgprd
(gprd-cny
)