I'm putting my initial thoughts (including concurrency implications) in a collapsible session at the bottom
as I now think this Issue's premise has a misstep:
RetiringVaults can't become InactiveVaults while they have an outbound,
because they can only become InactiveVaults after they've sent out all their funds.
[Second edit: Looking closer, these two examples aren't even the below--they appear to be true InactiveVault inbounds that started confirmation counting more than two days after the vaults became InactiveVaults,
not during-churn inbounds.]
|
This instead sounds like the existing #1820 (closed)
'Inbounds prematurely increase THORNode vault balances before confirmation counting'
(with an example of swallowed LTC worth ~150,000 RUNE)
|
in which the pre-confirmation inbound increases the vault balance,
THORChain tries (and succeeds) to migrate the inbound it isn't sure has happened yet,
the no-funds vault becomes an InactiveVault,
the confirmation count finishes and THORChain is now sure the inbound has happened,
since it's an InactiveVault it tries to refund it
(rather than attempting a gas-asset-only gas-asset-costing migrate to an ActiveVault and then acting on the memo with reduced Coins, logistically painful)
and the refund fails because the vault doesn't have the funds any more.
|
(I note 2024-01-07's unmerged !3351 (merged) 'Increase THORNode vault balance only after inbound confirmation counting'.)
This was a true after-churn InactiveVault inbound and refund attempt,
after having already been an InactiveVault for 39,791 blocks (about 2.76 days).
Collapsible section initial thoughts, including pipeline thoughts on TSS burden without concurrency or unsignable-item-blocking-all-other-InactiveVault-outbounds if with concurrency
(Particularly unfortunate if the Migrates which finished the vault retirement had also contributed to TSS burden which made rescheduling necessary, though depending on timing this might never be the case.)
I hypothesise that the intent was to avoid nodes being overwhelmed by TSS for rescheduled outbounds from many different InactiveVaults
(for instance if someone were to send just-over-dust to every InactiveVault of every chain for the past month,
maybe every node trying to do TSS signing of 80 outbounds at once);
v1.127.0's !3372 (merged) 'Pipeline Signing Routines' should hopefully be preventing that situation,
but I'd like to ask whether you think a single unsignable InactiveVault outbound
(for instance a refund from the oldest InactiveVault and not having 2/3rds members online)
might block all other InactiveVault outbounds until it were dropped from MaxOutboundAttempts.
SpawnSignings I think(/hope?) would always go through the InactiveVault items in the same order;
that way all available nodes for the 'earliest ranked' item would all try to sign it,
instead of (for instance) half trying to sign that and half trying to sign another that they were a part of (but the first group weren't).
|
There might be a third group which tried and failed to sign something that they and the second group were members of and the first group weren't,
but that signing couldn't take place until signing attempts for the earliest-ranked item in the list ended.
|
(Which returns to the matter of when the earliest-ranked item is an old unsignable one,
and none of the RetiringVault->InactiveVault outbounds can be signed by nodes which were members for it until it's dropped.)
This is plausibly a bad(/worse-than-ideal) suggestion, but LackSigning could choose to only drop-without-rescheduling the TxOutItem if it were a Refund? \
Refunds from RetiringVaults would have been inbounds to those same vaults,
but arguably by the point that something needing a refund has been sent to a RetiringVault rather than an ActiveVault something maybe have gone wrong on the user's side anyway. Hmm.
[Edit: Collapsible section now made; a GetVault would be meaningless,
as if it were a RetiringVault (I think) it would stay a RetiringVault until the rescheduled outbound succeeded or were dropped.]
Based on the above, another(/a complementary) plausibly-bad suggestion:
In refundTx, do a GetVault and don't do the https://gitlab.com/thorchain/thornode/-/blob/v1.127.0/x/thorchain/helpers.go#L71 VaultPubKey: tx.ObservedPubKey,
if the vault isn't an InactiveVault.
(The ideal being that an about-to-retire RetiringVault inbound is--wait.
[Edit: Collapsible section now made.]
It looks like all the ETH.SNX was migrated, ignoring that some of its funds were currently spoken for.
What should have happened (I believe) is that the migrate try to migrate only the unspoken-for ETH.SNX until the outbound was either fulfilled or reassigned to an ActiveVault (not possible in this case),
but it seems the migrate process ignored pending outbounds.
Notably, here for LenPendingTxBlockHeights only refers to pending migrates, not other pending outbounds.
The !vault.HasFunds()check should be done before any outbound deduction.
Unlike the txout manager (which only checks the TxOutItem asset), preferable to do all the deductions from the retiring vaults once,
for every asset.
perhaps using a map of vault PubKey to RetiringVault
Though tempting to postpone all migrations until there are no pending outbounds (simpler),
the best case scenario for that is that migrations don't start until outbounds have all been reassigned to ActiveVaults that have had no migrations yet
(behaviour I'd fairly-strongly prefer to avoid).