Praefect: can't start if unable to dial with one of the gitaly nodes
During the demo we observed praefect that was continuously restarting. The scenario was next:
- Setup was done according to demo doc.
- Cluster was in working state: read/write/replication worked properly.
- One of the gitaly nodes was stopped.
- Praefect node was restarted.
Praefect was not able to start normally because of failed dial to gitaly node.
Most likely this is due to NewManager that dials to all the nodes at startup to cache the connections.
{"level":"info","msg":"Wrapper started","time":"2020-06-05T12:19:50Z","wrapper":6315}
{"level":"info","msg":"finding gitaly","pid_file":"/var/opt/gitlab/praefect/praefect.pid","time":"2020-06-05T12:19:50Z","wrapper":6315}
{"level":"info","msg":"spawning a process","time":"2020-06-05T12:19:50Z","wrapper":6315}
{"gitaly":6321,"level":"info","msg":"monitoring gitaly","time":"2020-06-05T12:19:50Z","wrapper":6315}
{"level":"info","msg":"Starting praefect","pid":6321,"time":"2020-06-05T12:19:50.57Z","version":"Praefect, version 13.1.0-rc2-46-g8f85c2d4"}
{"address":"0.0.0.0:2305","level":"info","msg":"listening at tcp address","pid":6321,"time":"2020-06-05T12:19:50.57Z"}
{"level":"fatal","msg":"failed to dial \"10.150.0.26:8075\" connection: context deadline exceeded","pid":6321,"time":"2020-06-05T12:20:00.577Z"}
{"gitaly":6321,"level":"warning","msg":"forwarding signal","signal":17,"time":"2020-06-05T12:20:00Z","wrapper":6315}
{"error":"os: process already finished","gitaly":6321,"level":"error","msg":"can't forward the signal","signal":17,"time":"2020-06-05T12:20:00Z","wrapper":6315}
{"gitaly":6321,"level":"error","msg":"wrapper for gitaly shutting down","time":"2020-06-05T12:20:01Z","wrapper":6315}
/cc @zj-gitlab @jramsay