Skip to content

Leader wait from 1/2 to 9/10

Multipartite requested to merge Multi/nine-tenths-leader-wait into master

Please see !235 (comment 1161349435) , regarding !225 (merged)
'Leader reports peer timeout before peers' leader-wait timeout', which closed #110 (closed)
'set_tss_pool ID disagreement when a peer is offline, from leader-waiting-for-peers and peers-waiting-for-leader simultaneous timeout'.

Relevant I believe:
https://gitlab.com/thorchain/thornode/-/blob/v1.108.3/cmd/bifrost/main.go#L133 PartyTimeout: 45 * time.Second,

To summarise (with keygens as an example):

Previously when a keygen failed due to a node being offline,
half the nodes might blame the responsible node and the leader, while half the nodes might blame only the leader.
This was because the member nodes stop waiting for the leader at the same time that the leader stops waiting for the responsible member,
so some nodes blame the leader before hearing from the leader,
and so no 2/3rds consensus is reached (the responsible node goes unblamed).

After changing the leader's wait to half the members' wait, the leader (I believe) waits 22.5 seconds instead of 45 seconds.
This avoids the above situation nicely, but occasionally a single node will be a few blocks later than others and the keygen will fail, blaming that node.

This MR aims to move the leader's wait back in the direction of what it was originally,
narrowing the time for the members to receive its information to 4.5 seconds (less than the timespan of one block).
Members would then have 41.5 seconds for key generation and leader-contacting, rather than 22.5 seconds.
(Longer by 19 seconds, over 3 block timespans).

Other wait proportions could still be tried in future
(such as 19/20 rather than 9/10, or a smaller-than-9/10 ratio if the original problem occurs again),
or an absolute offset in seconds rather than a multiplicative proportion,
but my hope is that as 9/10 is a proportion between 10/10 and 5/10 it will have a lower chance of being reverted to 10/10 when unrelated Go-TSS updates are tested.

As always, feedback is welcome.


Historical context (of the line in joinPartyLeader):

https://gitlab.com/thorchain/tss/go-tss/-/blame/v1.5.2/p2p/party_coordinator.go#L441
case <-time.After(pc.timeout):
(dating back to 2020-09-29, 'leader join party v1')

https://gitlab.com/thorchain/tss/go-tss/-/blame/v1.5.3/p2p/party_coordinator.go#L441
https://gitlab.com/thorchain/tss/go-tss/-/blame/v1.5.5/p2p/party_coordinator.go#L441
case <-time.After(pc.timeout / 2):
(2022-09-13, !225 (merged))

https://gitlab.com/thorchain/tss/go-tss/-/blame/v1.5.6/p2p/party_coordinator.go#L441
https://gitlab.com/thorchain/tss/go-tss/-/blame/v1.5.9/p2p/party_coordinator.go#L441
case <-time.After(pc.timeout):
(2022-11-03, '[ADD] apply appropriate resource limit to libp2p' (!235 (merged)))

https://gitlab.com/thorchain/tss/go-tss/-/blame/v1.6.0/p2p/party_coordinator.go#L441
case <-time.After(pc.timeout / 2):
(2023-04-19, 'Revert "Update libp2p to 0.26.3"' (!242 (merged)))

The THORNode versions which have used different Go-TSS versions can be seen from
https://gitlab.com/thorchain/thornode/-/blame/v1.108.3/go.mod#L52
, changing the version in the URL while noting the the Go-TSS version changed from by the most recent commit for that line.
My current understanding is that twice it has been briefly reverted simultaneously with a libp2p update,
then restored as that update was itself reverted.

Edited by Multipartite

Merge request reports