Praefect loosing connection to gitaly during hard disk stress test
Praefect loosing connection to gitaly during hard disk stress test
While doing testing for gitlab-org/quality/reference-architectures#90 (closed) I ran into an issue where 2/3 praefect nodes stopped updating the 'node_status.last_seen_active_at' column, while doing stress testing 3 praefect -> 3 gitaly setup in Azure. Upon finishing the test, I would have expected that the issue resolve itself and the node_status table update itself, but this didn't happen.
Steps that led to issue:
- On the 3 gitaly nodes:
stress-ng --hdd 0 -t 1200s --aggressive --hdd-bytes 256G
- Concurrently: zero-downtime-testing-tool
./zdt-verifier --environment ./configs/customer-25k-azure.yaml --git-loop-delay 5 --readiness-loop-delay 5 --log-dir ./logs
During the test I noted that after about 10 minutespraefect1
and praefect2
encountered some issues where they started to report the following when connecting to gitaly2
{"component":"HealthManager","error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","level":"error","msg":"failed checking node health","pid":2278,"storage":"gitaly-2","time":"2021-11-18T03:31:48.822Z","virtual_storage":"default"}
praefect3
had no such issue.
SELECT * FROM node_status ORDER BY praefect_name, node_name;
1 "gitlab-qa-25k-customer-praefect-1:0.0.0.0:2305" "default" "gitaly-1" "2021-11-18 03:50:42.927176+00" "2021-11-18 03:50:42.927176+00"
2 "gitlab-qa-25k-customer-praefect-1:0.0.0.0:2305" "default" "gitaly-2" "2021-11-18 03:50:42.927176+00" "2021-11-18 03:31:22.317536+00" -- Lost Connection 10 mins into test
3 "gitlab-qa-25k-customer-praefect-1:0.0.0.0:2305" "default" "gitaly-3" "2021-11-18 03:50:42.927176+00" "2021-11-18 03:50:42.927176+00"
1966 "gitlab-qa-25k-customer-praefect-2:0.0.0.0:2305" "default" "gitaly-1" "2021-11-18 03:50:42.755291+00" "2021-11-18 03:50:42.755291+00"
1967 "gitlab-qa-25k-customer-praefect-2:0.0.0.0:2305" "default" "gitaly-2" "2021-11-18 03:50:42.755291+00" "2021-11-18 03:31:22.630026+00" -- Lost Connection 10 mins into test
1968 "gitlab-qa-25k-customer-praefect-2:0.0.0.0:2305" "default" "gitaly-3" "2021-11-18 03:50:42.755291+00" "2021-11-18 03:50:42.755291+00"
1969 "gitlab-qa-25k-customer-praefect-3:0.0.0.0:2305" "default" "gitaly-1" "2021-11-18 03:50:43.817782+00" "2021-11-18 03:50:43.817782+00"
1970 "gitlab-qa-25k-customer-praefect-3:0.0.0.0:2305" "default" "gitaly-2" "2021-11-18 03:50:43.817782+00" "2021-11-18 03:50:43.817782+00" -- This praefect has no issues
1971 "gitlab-qa-25k-customer-praefect-3:0.0.0.0:2305" "default" "gitaly-3" "2021-11-18 03:50:43.817782+00" "2021-11-18 03:50:43.817782+00"
SELECT * FROM healthy_storages;
"default" "gitaly-1"
"default" "gitaly-3"
On all 3 praefect nodes praefect dial-nodes was reporting successful, so I would have though that this would mean that the node_status table should also be updated.
sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml dial-nodes`
2021/11/18 06:18:53 [tcp://172.17.0.23:8075]: SUCCESS: confirmed Gitaly storage "gitaly-1" in virtual storages [default] is served
2021/11/18 06:18:53 [tcp://172.17.0.19:8075]: SUCCESS: confirmed Gitaly storage "gitaly-3" in virtual storages [default] is served
2021/11/18 06:18:53 [tcp://172.17.0.32:8075]: SUCCESS: confirmed Gitaly storage "gitaly-2" in virtual storages [default] is served
Restarting praefect using gitlab-ctl restart praefect
resolved the issue.