Missing block signatures currently do not increment slash points
Background summary (clearer explanation welcome): \
If I understand correctly, THORChain makes new blocks when one validator (Active node) proposes a block and 2/3rds of all validators agree to it.
If more than 1/3rd of all nodes were to stop taking part in the block-making process,
THORChain itself would stop producing blocks;
this is the same thing that happens if there is a consensus failure in which two groups each over 1/3rd arrive at two incompatible blocks (due to running different code, and thus each group unable to validate a block from the other group).
The 'signatures' field at
https://thornode-v1.ninerealms.com/blocks/latest
(there for the most recent block)
contains a list of items, one for each validator,
with a unique validator_address
and signature
.
However, sometimes a node (due to technical problems) stops signing blocks
(taking part in the block-making process)
and an item's signature will be null
instead.
As with #1454 (closed)/!2722 (merged) (a similar Issue for broadcast messages),
is is undesirable if there are no consequences for ceasing to participate in aspects of THORChain's operation,
particularly for the vital activity of block-making which underpins everything that takes place in those blocks.
As an incentive, then, this Issue proposes that Active nodes which haven't taken part in signing a block
(that block having been accepted by 2/3rds of all Active nodes)
incur a slash point (or slash points) for this, effectively losing one block's worth of rewards for each slash point incurred.
The ideal is perhaps that a node that went offline would
(from a combination of unbroadcast messages and unsigned blocks)
gradually lose rewards already accrued, so that a node could not accrue some rewards and then go offline for a while with no consequence other than opportunity cost.
(That is, a strong incentive to maintain sufficient functionality, while not involving bond slashes (which would be reducing bonded RUNE rather than RUNE to be received upon the next churn.)
[This Issue largely serves as a record of my thought process; discussion is welcome.]
(Relatedly, please note this discussion thread for #1691, as well as its !3163 (merged).)
Example:
https://thornode-v1.ninerealms.com/blocks/12500000
In .block.last_commit.signatures, there are 89 items total,
of which 88 have block_id_flag 2 and a non-""
validator_address,
while one item (the 6th) has block_id_flag 1, a ""
validator_address, and a null
signature.
https://thornode-v1.ninerealms.com/thorchain/nodes?height=12500000
There are exactly 89 "Active" nodes in that block.
As a continued example, the first listed (.z4lt) indicates ip_address 68.183.244.113 .
68.183.244.113:27147/status
.result.validator_info.address is A0F287976D0D8BC8A156A149E95C33E090009B71,
which is in the 51st signatures item in the /blocks/ endpoint above,
indicating that .z4lt is not the node which failed to sign that block.
The Slasher Manager for HandleDoubleSign
( https://gitlab.com/thorchain/thornode/-/blob/v1.123.0/x/thorchain/manager_slasher_current.go#L38-46 )
uses the abci.RequestBeginBlock's ByzantineValidators evidence information to check for double signs.
However, this evidence itself does not reflect the signatures, only specifying double-signs.
https://docs.cosmos.network/v0.45/modules/evidence/06_begin_block.html#equivocation
Relevant I believe:
https://gitlab.com/thorchain/thornode/-/blob/v1.123.0/third_party/proto/tendermint/abci/types.proto#L354-363
enum EvidenceType {
UNKNOWN = 0;
DUPLICATE_VOTE = 1;
LIGHT_CLIENT_ATTACK = 2;
}
message Evidence {
EvidenceType type = 1;
// The offending validator
Validator validator = 2 [(gogoproto.nullable) = false];
Possibly also this:
https://gitlab.com/thorchain/thornode/-/blob/v1.123.0/third_party/proto/tendermint/types/evidence.proto#L21-25
Also relevant, in the /blocks/ endpoint above, .block.evidence.evidence is an empty array.
However:
message RequestBeginBlock {
bytes hash = 1;
tendermint.types.Header header = 2 [(gogoproto.nullable) = false];
LastCommitInfo last_commit_info = 3 [(gogoproto.nullable) = false];
repeated Evidence byzantine_validators = 4 [(gogoproto.nullable) = false];
}
message LastCommitInfo {
int32 round = 1;
repeated VoteInfo votes = 2 [(gogoproto.nullable) = false];
}
// VoteInfo
message VoteInfo {
Validator validator = 1 [(gogoproto.nullable) = false];
bool signed_last_block = 2;
}
// Validator
message Validator {
bytes address = 1; // The first 20 bytes of SHA256(public key)
// PubKey pub_key = 2 [(gogoproto.nullable)=false];
int64 power = 3; // The voting power
}
(That Validator is plausibly the same Validator type as in the ByzantineValidators Evidence.)
Rather than the RequestBeginBlock's ByzantineValidators field, then,
the RequestBeginBlock's LastCommitInfo field looks usable for checking which nodes didn't vote for the last block.
Something to consider is what happens when the Active nodes change (when there's a churn block).
The node UpdateStatus
calls are in the Validator Manager's EndBlock
,
so not being convenient for passing the information to the RequestBeginBlock's BeginBlock
in the next block.
If checking against the current Active nodes, then newly-Active nodes could receive slash points for blocks in which they were Ready and not Active.
If checking against the previous block's Active nodes, then nodes trying to start from a pruned snapshot could not retrieve that information and be unable to sync.
While a small slash on each churn might not be bothersome,
for now I suggest instead that if a LastCommitInfo signature contains a node which is not currently an Active node,
the previous block is assumed to be a churn block, with no missing-vote slashing carried out.
|
(This is particularly if there is always at least one Active node set to Standby in a churn block,
though there is an edge case of if the Active node/s being churned out had not voted for the churn block,
for example due to being offline.)
|
A similar approach is to compare the number of signatures (of which some might be null
) with the number of validators,
again being ineffective when the number of nodes before and after the churn is the same (which is fairly common).
|
Another measure (plausibly improving most blocks' checking-performance as well!)
could be to not check the node addresses at all if there were no null
signatures (block_id_flag 1).
The above connected paragraphs may in fact be moot; since VoteInfo does not include block_id_flag, but rather contains signed_last_block,
it could be that the LastCommitInfo Votes field contains identification of every validator for that block
(not present in the /blocks/ endpoint)
, and so the only action needed for slashing being to check the validators for which signed_lsat_block is false
,
slashing them if they are currently an Active node (a validator)
and ignoring them if they are not (as they are now Standby and unaffected by slash points).
HandleDoubleSign
already shows one way of associating the validator information with the node account:
https://gitlab.com/thorchain/thornode/-/blob/v1.123.0/x/thorchain/manager_slasher_current.go#L65-76
nas, err := s.keeper.ListActiveValidators(ctx)
if err != nil {
return err
}
for _, na := range nas {
pk, err := cosmos.GetPubKeyFromBech32(cosmos.Bech32PubKeyTypeConsPub, na.ValidatorConsPubKey)
if err != nil {
return err
}
if addr.String() == pk.Address().String() {
It would however be gratifying if GetNodeAccount
were usable to get the node account without needing a range.
Perhaps for that GetNodeAccountByPubKey
is usable.
https://gitlab.com/thorchain/thornode/-/blob/v1.123.0/x/thorchain/keeper/v1/keeper_node_account.go#L179-186
// GetNodeAccountByPubKey try to get node account with the given pubkey from db
func (k KVStore) GetNodeAccountByPubKey(ctx cosmos.Context, pk common.PubKey) (NodeAccount, error) {
addr, err := pk.GetThorAddress()
if err != nil {
return NodeAccount{}, err
}
return k.GetNodeAccount(ctx, addr)
}
Referring to
https://thornode-v1.ninerealms.com/thorchain/node/thor10czf2s89h79fsjmqqck85cdqeq536hw5ngz4lt?height=12500000
which above was noted to correspond to validator address A0F287976D0D8BC8A156A149E95C33E090009B71,
the .validator_cons_pub_key is thorcpub1zcjduepqu9yfgcfg0fq2yx6aphzhzvd7ke046nmw3xhg8vw736exj48wnk8sd5wkk0,
and the .pub_key_set.secp256k1 is thorpub1addwnpepqtxghkykhrwzwww30lvvv5wve68lpcert5wsz3lmqvdxxhnaylelg953pg0.
|
GetNodeAccountByPubKey
I believe is used with that .pub_key_set.secp256k1 pubkey.
|
A question then is whether the above
cosmos.GetPubKeyFromBech32(cosmos.Bech32PubKeyTypeConsPub, na.ValidatorConsPubKey)
produces the .pub_key_set.secp256k1 from the .validator_cons_pub_key,
or whether it produces something completely different.
|
If the Validator address is (as indicated by an above code comment) "The first 20 bytes of SHA256(public key)",
it might in practice not be possible to derive the node's full pubkey from the validator address (thus indeed requiring a range),
though there could be a way I'm not aware of to go directly from the validator address to the node address
(but only if the node address can be derived from those first 20 bytes).
|
The easiest approach is perhaps then to use the same range approach as HandleDoubleSign
,
and for performance relying on that the Active nodes are not checked unless there are nodes which did not vote;
ideally the list of Active node pk.Address().String()
would only be derived once and reused for all missing votes.
In Mainnet code, added code for this slashing would be in a new-version Slasher Manager.
For testing, however, the code could be added to the existing-version Slasher Manager, then sync attempted from a (snapshot) set historical height,
confirming that there was no sync failure for blocks with no missing signatures,
and then a sync failure after a block with a missing signature
(due to slashing the node or nodes which didn't vote on that previous block, perhaps with a log of which nodes were slashed).
My specific code proposal at this time: !3196 (merged)
'LackOfObservationPenalty
for missing block signatures'.