Praefect loosing connection to gitaly during hard disk stress test

While doing testing for gitlab-org/quality/reference-architectures#90 (closed) I ran into an issue where 2/3 praefect nodes stopped updating the 'node_status.last_seen_active_at' column, while doing stress testing 3 praefect -> 3 gitaly setup in Azure. Upon finishing the test, I would have expected that the issue resolve itself and the node_status table update itself, but this didn't happen.

Steps that led to issue:

On the 3 gitaly nodes: stress-ng --hdd 0 -t 1200s --aggressive --hdd-bytes 256G
Concurrently: zero-downtime-testing-tool ./zdt-verifier --environment ./configs/customer-25k-azure.yaml --git-loop-delay 5 --readiness-loop-delay 5 --log-dir ./logs

During the test I noted that after about 10 minutespraefect1 and praefect2 encountered some issues where they started to report the following when connecting to gitaly2 {"component":"HealthManager","error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","level":"error","msg":"failed checking node health","pid":2278,"storage":"gitaly-2","time":"2021-11-18T03:31:48.822Z","virtual_storage":"default"} praefect3 had no such issue.

SELECT * FROM node_status ORDER BY praefect_name, node_name;

   1    "gitlab-qa-25k-customer-praefect-1:0.0.0.0:2305"	"default"	"gitaly-1"	"2021-11-18 03:50:42.927176+00"	"2021-11-18 03:50:42.927176+00"
   2    "gitlab-qa-25k-customer-praefect-1:0.0.0.0:2305"	"default"	"gitaly-2"	"2021-11-18 03:50:42.927176+00"	"2021-11-18 03:31:22.317536+00"     -- Lost Connection 10 mins into test
   3    "gitlab-qa-25k-customer-praefect-1:0.0.0.0:2305"	"default"	"gitaly-3"	"2021-11-18 03:50:42.927176+00"	"2021-11-18 03:50:42.927176+00"
1966	"gitlab-qa-25k-customer-praefect-2:0.0.0.0:2305"	"default"	"gitaly-1"	"2021-11-18 03:50:42.755291+00"	"2021-11-18 03:50:42.755291+00"
1967	"gitlab-qa-25k-customer-praefect-2:0.0.0.0:2305"	"default"	"gitaly-2"	"2021-11-18 03:50:42.755291+00"	"2021-11-18 03:31:22.630026+00"     -- Lost Connection 10 mins into test
1968	"gitlab-qa-25k-customer-praefect-2:0.0.0.0:2305"	"default"	"gitaly-3"	"2021-11-18 03:50:42.755291+00"	"2021-11-18 03:50:42.755291+00"
1969	"gitlab-qa-25k-customer-praefect-3:0.0.0.0:2305"	"default"	"gitaly-1"	"2021-11-18 03:50:43.817782+00"	"2021-11-18 03:50:43.817782+00"
1970	"gitlab-qa-25k-customer-praefect-3:0.0.0.0:2305"	"default"	"gitaly-2"	"2021-11-18 03:50:43.817782+00"	"2021-11-18 03:50:43.817782+00"     -- This praefect has no issues
1971	"gitlab-qa-25k-customer-praefect-3:0.0.0.0:2305"	"default"	"gitaly-3"	"2021-11-18 03:50:43.817782+00"	"2021-11-18 03:50:43.817782+00"

SELECT * FROM healthy_storages;

"default"	"gitaly-1"
"default"	"gitaly-3"

On all 3 praefect nodes praefect dial-nodes was reporting successful, so I would have though that this would mean that the node_status table should also be updated.

sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml dial-nodes`

2021/11/18 06:18:53 [tcp://172.17.0.23:8075]: SUCCESS: confirmed Gitaly storage "gitaly-1" in virtual storages [default] is served
2021/11/18 06:18:53 [tcp://172.17.0.19:8075]: SUCCESS: confirmed Gitaly storage "gitaly-3" in virtual storages [default] is served
2021/11/18 06:18:53 [tcp://172.17.0.32:8075]: SUCCESS: confirmed Gitaly storage "gitaly-2" in virtual storages [default] is served

Restarting praefect using gitlab-ctl restart praefectresolved the issue.

Edited Nov 18, 2021 by John McDonnell

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information