Praefect timing out when creating a GitLab backup
Support Request for the Gitaly Team
The goal is to keep these requests public. However, if customer information is required to the support request, please be sure to mark this issue as confidential.
This request template is part of Gitaly Team's intake process.
Author Checklist
-
Reached out to #spt_pod_git prior creating issue (please provide link) -
Fill out customer information section -
Provide an detail summary under Additional Information:
-
-
Severity realistically set -
Provided detailed problem description -
Provided detailed troubleshooting performed -
Clearly articulated what is needed from the Gitaly team to support your request by filling out the What specifically do you need from the Gitaly team
Customer Information
Salesforce Link: https://gitlab.my.salesforce.com/0014M00001ySom4QAC
Zendesk Ticket: https://gitlab.zendesk.com/agent/tickets/517982
Installation Size: Large
Architecture Information:
Slack Channel:
Additional Information:
Support Request
Severity
Problem Description
We have a customer that upgraded from 15.9.3 to version 16.7.7. After the upgrade, the gitlab-backup operation is not performed successfully in versions 16.1.6 and 16.7.7.
They are seeing a timeout in the Praefect load balancer:
...
@hashed/de/ee/deeeb5df3f2cee6bf4e597a8a3a878a6ce49b932b9e90b416922d4499f54fae6.wiki.git (xxx.wiki): manager: remote repository: object hash: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.30.44.12:2305: i/o timeout\"\n - @hashed/de/ee/deeeb5df3f2cee6bf4e597a8a3a878a6ce49b932b9e90b416922d4499f54fae6.design.git (xxx.design): manager: remote repository: object hash: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.30.44.12:2305: i/o timeout\"\n - @hashed/04/a8/04a8708c3a481ced13845a30de522486895de0592222c29326d9139ec2b9df25.git (xxx): manager: remote repository: object hash: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.30.44.12:2305: i/o timeout\"\n - @hashed/04/a8/04a8708c3a481ced13845a30de522486895de0592222c29326d9139ec2b9df25.wiki.git (xxx.wiki): manager: remote repository: object hash: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.30.44.12:2305: i/o timeout\"\n - @hashed/04/a8/04a8708c3a481ced13845a30de522486895de0592222c29326d9139ec2b9df25.design.git (xxx.design): manager: remote repository: object hash: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.30.44.12:2305: i/o timeout\"\n - @hashed/9a/72/9a72c24f2fd76561729110d804c69f38a7088f2ec41fdf8fbfea20d07e8bcff8.git (xxx): manager: remote repository: object hash: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.30.44.12:2305: i/o timeout\"\n - @hashed/9a/72/9a72c24f2fd76561729110d804c69f38a7088f2ec41fdf8fbfea20d07e8bcff8.wiki.git (xxx.wiki): manager: remote repository: object hash: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.30.44.12:2305: i/o timeout\"\n - @hashed/9a/72/9a72c24f2fd76561729110d804c69f38a7088f2ec41fdf8fbfea20d07e8bcff8.design.git (xxx.design): manager: remote repository: object hash: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.30.44.12:2305: i/o timeout\"\n - @hashed/45/95/459535faa370a3b5f8b87203b089623c7aeb9325abf241ec8a685b9c325047a3.git (xxx): manager: remote repository: object hash: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.30.44.12:2305: i/o timeout\"\n - @hashed/45/95/459535faa370a3b5f8b87203b089623c7aeb9325abf241ec8a685b9c325047a3.wiki.git (xxx.wiki): manager: remote repository: object hash: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.30.44.12:2305: i/o timeout\"\n - @hashed/45/95/459535faa370a3b5f8b87203b089623c7aeb9325abf241ec8a685b9c325047a3.design.git (xxx.design): manager: remote repository: object hash: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.30.44.12:2305: i/o timeout\"\n - @hashed/09/a1/09a1b036b82baba3177d83c27c1f7d0beacaac6de1c5fdcc9680c49f638c5fb9.git (xxx): manager: remote repository: object hash: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.30.44.12:2305: i/o timeout\"\n","pid":679,"time":"2024-04-05T18:09:48.777Z"}
...
10.30.44.12:2305
points to the praefect load balancer.
To rule out issues with load balancer, we tried pointing the backup rails node directly to one of the praefect node. It still timeout, but this time with the Praefect ip address:
{"command":"create","gl_project_path":"xxx","level":"info","msg":"started create","pid":982,"relative_path":"@hashed/37/83/37834f2f25762f23e1f74a531cbe445db73d6765ebe60878a7dfbecd7d4af6e1.git","storage_name":"default","time":"2024-04-22T17:02:37.669Z"}
{"command":"create","error":"manager: write bundle: remote repository: create bundle: rpc error: code = Unavailable desc = error reading from server: read tcp 10.30.44.95:57936-\u003e10.30.44.73:2305: read: connection timed out","gl_project_path":"xxx","level":"error","msg":"create failed","pid":982,"relative_path":"@hashed/37/83/37834f2f25762f23e1f74a531cbe445db73d6765ebe60878a7dfbecd7d4af6e1.git","storage_name":"default","time":"2024-04-22T17:05:52.365Z"}
{"level":"error","msg":"create: pipeline: 1 failures encountered:\n - @hashed/37/83/37834f2f25762f23e1f74a531cbe445db73d6765ebe60878a7dfbecd7d4af6e1.git (xxx): manager: write bundle: remote repository: create bundle: rpc error: code = Unavailable desc = error reading from server: read tcp 10.30.44.95:57936-\u003e10.30.44.73:2305: read: connection timed out\n","pid":982,"time":"2024-04-22T17:13:46.014Z"}
10.30.44.73:2305
is one of the Praefect nodes.
According to the customer, it's intermittent on whether the backup succeeds or not.
- What version is the customer running? GitLab 16.7.7
- What is the customers architecture? I will share the architecture as an internal note
- What is the GitLab architecture?
- Are networking filesystems (like NFS) used?
- What are the filesystems?
- What are the OS and kernel versions?
- How are backup, replication, HA, etc performed?
- Are they using Gitaly Cluster? Yes
- How many Gitaly Clusters the customer has? 1 Gitaly cluster
- How many Gitaly nodes per cluster the customer has configured? 3 Gitaly nodes
Troubleshooting Performed
- Customer tried using the following variables
GITLAB_BACKUP_MAX_CONCURRENCY=3 GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY=1
to try to reduce the load in the praefect load balancer. - We tried pointing the backup rails node directly to one of the Praefect node to rule out issues with the load balancer as they mentioned the following:
What's unique is that the backup will use all of the network bandwidth that the load balancer equipment can accommodate. At this time, I wonder if it is a problem for backup due to packet loss caused by bandwidth excess, so please check if you can put a bandwidth limit on the process of calling the internal loadbalancer in gitlab-backup
- We also made sure that the backup rails node can connect to Praefect:
$ podman exec xxx gitlab-rake gitlab:gitaly:check
Checking Gitaly ...
Gitaly: ... default ... OK
Checking Gitaly ... Finished
What specifically do you need from the Gitaly team
Help identify why the connection is timing out on Praefect when a backup is being taken. As per the customer, this started happening after they upgraded their GitLab instance.
/cc @mjwood @andrashorvath @jcaigitlab @john.mcdonnell @gerardo