set_tss_pool ID disagreement when a peer is offline, from leader-waiting-for-peers and peers-waiting-for-leader simultaneous timeout
Example:
Block 6415584, containing 32 set_tss_pool messages.
IDs:
26ed3bf361f7b93d03abf18e7237e25f3aebc2f21ec61bde8aac110635315315 (2/17 messages blaming pubkey .c4kj)
3f7c92c46ead944ed3d6863145d9f94e214c51a8c4545962ad1a134f710804c8 (15/17 messages blaming pubkeys .c4kj and .rp5g)
033adb951a0bb5348cd648d8cbbc5905c00e025a7e7eb063c18c04e5796e5c55 (8 messages blaming pubkey .067w)
b9fd89f197692516c21a5a8709f483da74ffc273786d4cda3012a544053986e6 (7 messages blaming pubkeys .067w, .5urc, and .xm86)
Code:
https://gitlab.com/thorchain/thornode/-/blob/v1.93.0/cmd/bifrost/main.go#L131-133
KeyGenTimeout: 300 * time.Second, // must be shorter than constants.JailTimeKeygen
PartyTimeout: 45 * time.Second,
https://gitlab.com/thorchain/tss/go-tss/-/blob/v1.5.1/tss/tss.go#L104
pc := p2p.NewPartyCoordinator(comm.GetHost(), conf.PartyTimeout)
joinPartyMember
:
https://gitlab.com/thorchain/tss/go-tss/-/blob/v1.5.1/p2p/party_coordinator.go#L371-375
case <-time.After(pc.timeout):
// timeout
close(done)
pc.logger.Error().Msg("the leader has not reply us")
return
joinPartyLeader
:
https://gitlab.com/thorchain/tss/go-tss/-/blob/v1.5.1/p2p/party_coordinator.go#L441-444
case <-time.After(pc.timeout):
// timeout
pc.logger.Error().Msg("leader waits for peers timeout")
return
Summary:
If the leader and online peers receive the keygen block at the same time,
the leader will wait 45 seconds before blaming an offline peer and informing the other peers,
while the online peers will wait 45 seconds before they stop waiting for the leader's report and blame the leader.
As a result, it is random whether a peer times out right before or right after the leader's report,
and if there isn't 2/3rds ID consensus then there 'fail keygen' slash points and jailing will not be applied.
To keep PartyTimeout
as the full time for the party, I currently propose that joinPartyLeader
alone have
case <-time.After(pc.timeout / 2):
,
such that peers have half the PartyTimeout
time to check in with the leader and the leader has half the PartyTimeout
time to report to the peers.
(I would also like to consider in a separate Issue or/and Merge Request the !182 (merged) "it is a bit unfair" concept;
specifically, whether the code could be changed such that the leader never blames itself, and peers only blame the leader if not having received a report from the leader.
That is to say, only blaming/slashing/jailing the actual offline nodes.)
Edit: For a more recent example, the 2022-07-28 churn had one retry. The first churn try had keygen block 6622228, and the successful churn was on block 6622955.
https://thornode.ninerealms.com/txs?limit=100&message.action=set_tss_pool&tx.height=6622237
5f48f2b46916539016fad734e6ea3b99c47a7e4e552a85c7b2b7decb5070d7a9
9/17, blames pub_key .65xk
https://thornode.ninerealms.com/txs?limit=100&message.action=set_tss_pool&tx.height=6622238
5f48f2b46916539016fad734e6ea3b99c47a7e4e552a85c7b2b7decb5070d7a9
3/17, blames pub_key .65xk [2/3rds threshold]
d162f165970c403d9ff3225e40023ea7dc30b4f7a8a61835b79264786be12317
5/17, blames pub_keys .65xk (the leader) and .gftv, which were the node addresses .9nzt and .7af2 respectively.
https://thornode.ninerealms.com/thorchain/node/thor19sgevjeyx033qrsvxugymevw0v87cf78p29nzt?height=6622238
https://thornode.ninerealms.com/thorchain/node/thor1raylctzthcvjc0a5pv5ckzjr3rgxk5qcwu7af2?height=6622238
During the churn try, both were 'Ready' non-Active nodes.
https://thornode.ninerealms.com/thorchain/node/thor19sgevjeyx033qrsvxugymevw0v87cf78p29nzt?height=6622955
https://thornode.ninerealms.com/thorchain/node/thor1raylctzthcvjc0a5pv5ckzjr3rgxk5qcwu7af2?height=6622955
.7af2 is the one that did not respond to the leader, but the leader was jailed and remained Standby while .7af2 churned in on the retry and became Active.
Second edit: I was previously inclined to have the leader never blame itself (and those who accept the leader's information about who to blame also not blaming the leader), but I am now inclined to leave it as it is: the position of leader (for a given keygen or keysign) is somewhat to represent the group (suffering slash points should the group fail its keygen or keysign), and to not be capable of falsifying a member's lack of participation without itself incurring an equal-in-magnitude penalty.