[INCIDENT REVIEW] Maps Are Evil
Maps Are Evil Incident Review
Summary
On Friday Nov 12, THORChain reached a consensus failure due to an iteration over a map error-ing at different indexes. This cause the chain to halt. After a few initial approaches, the full resync method was chosen. Consensus was restored on Nov 17.
After the network was restored, there was a secondary issue when trading was resumed before all the node's bifrosts had reached the tip of each chain. This caused some nodes to receive slash points for not observing transactions. Trading was halted on these chains until they caught up.
Follow-up list is at the bottom, please feel free to suggest other follow ups in the comments or file issues directly.
Timeline
Friday Nov 12 (All times GMT)
14:35: Last consensus block: 2943996 https://thornode.thorchain.info/blocks/2943996
14:47: Initial Report: https://discord.com/channels/838986635756044328/839002619481554955/908729819552956467
16:41: Root cause udpate: https://discord.com/channels/838986635756044328/839002619481554955/908758512736276490
16:50: Initial PR: !1995 (merged)
The remainder of that Friday was spent trying to get tendermints new rollback command to work.
By late Friday, it was clear this wasn't going to suit our needs, and the team started working on the resync strategy.
Sat - Monday
This time was all spent testing/verifying the resync and preparing/testing the final instructions.
Tues Nov 16
02:51: Resync instructions (#1165 (closed)) posted to discord: https://discord.com/channels/838986635756044328/839001804812451873/909999025195974717
Weds Nov 18
02:36: Speed up patch posted to discord: https://discord.com/channels/838986635756044328/839001804812451873/910719992222335057
05:05: Consensus restored: https://thornode.thorchain.info/blocks/2943998
14:08: Bifrost start instructions sent: https://discord.com/channels/838986635756044328/839001804812451873/910894225112059916
18:46: Trading enabled: https://discord.com/channels/838986635756044328/839001804812451873/910964212635947049
19:52: BTC/ETH halted due to slashes: https://discord.com/channels/838986635756044328/839001804812451873/910980771739734066
Friday Nov 19
13:55: BTC/ETH enabled: https://discord.com/channels/838986635756044328/839001804812451873/911253451978190858
17:18: LP Enabled: https://discord.com/channels/838986635756044328/839001804812451873/911304381838819328
Detection
Thanks to Cam on discord for noticing this relatively quickly, but 12 minutes is far too long for THORChain. One key follow-up will be better alerting for NO and community members of consensus failure or times between blocks.
Escalation
Once this was raised, a response team was assembled relatively quickly, but we were lucky in terms of the timing (~9:30 AM Eastern Time in the US). Had this occurred at an worse time it could have been much longer until people were aware. Given the distributed nature of contributors there likely would have been someone aware, but it is difficult to wake other contributors up when necessary, while also preserving individuals privacy.
Remediation
Rollback investigation The initial investigation into the rollback was promising. It seemed to do exactly what we wanted, but given how new the feature was and the lack of detailed understanding of exactly how it functioned, this ended up not working. This will definitely be investigated going forward, as it is believed this should be a viable option given proper application level support.
Soft Fork During the entire time, various contributors were investigating the soft fork approach. Due to lack of recent testing and uncertainty how safe this was, it was not attempted. Going forward this will definitely be explored, as having the option could be very helpful in the future (emergency or just to compact chain state).
Resync As it became clearer that rollback wouldn't work, the resync method began to be investigated and tested. Part of the difficulty here was in setting up the test properly. A new mocknet had to be created, mock chains spun up, nodes churned in, code-push with hacked version to cause consensus failure, re-push of fixed code version, and then verifying the fix.
The support for this isn't as good as it should be, and not many people are able to do the above in a reasonable amount of time. This limited our ability to rapidly test this and other fixes.
There was some delay in the nodes reaching consensus, but it wasn't significant. Only later would we realize that this was due to the difference in the number of pre-vote rounds that had been attempted.
Once the idea was proven, there was still significant testing to verify the fix itself, and then testing the packaging of the fix via the node launcher scripts. This was tested on numerous live validators to ensure the upgrade flow would be smooth. Again this took a number of hours.
Only after the fix went live on 2/3rds of the nodes, did the pre-vote problem start to become apparent. There were tests of editing the config file to change the delta times, but when restarting a node with these changes the node wouldn't start. Unfortunately this was a red herring, since we assumed the restart failure was somehow due to the config changes.
Once it was realized that any node that had resynced would stall on restart, focus shifted there, since that should then unlock the ability to edit the config file. This is where we got the dedicated help of a member of the tendermint team to troubleshoot, and realize we needed to try removing the cs.wal
directory. After testing this, we were able to edit the config file and do a restart.
There was another round of testing to verify this would work across various validators, and then figuring out the cleanest/easiest way to deploy it. This was when we decided to give the raw commands, instead of spending further time to validate the cleaner kubernetes fix. Fortunately it was quickly adopted.
Prevention
The code has already been reviewed for usages where a map is iterated over, but another review would be a good idea.
Eliminating this class of errors is almost impossible, but there is a lot to do to make recovery faster going forward. Now that we have a process in place for this, we should try to ensure nodes have nightly snapshots by default.
There should also be better review and less urgency when issuing non-time critical mimir commands. While this is mostly a human process for now, node-mimir
will eventually make this a reality. A better community dashboard for chain health could have helped here.
List of follow up items/tasks, will try to add items here from comments below. Please feel free to suggest anything or create an issue.
-
Cleanup nginx endpoint for trading halt override -
Enable bifrost and catchup -
Enable Trading -
Enable LP - [] Alerting for chain halt or slow block times: thorchain/devops/node-launcher#56
- [] Soft fork: #1166 (closed)
- [] Restore slash logic: #1167 (closed)
- [] Investigate automatic snapshots for nodes: thorchain/devops/node-launcher#55
- [] Easily halt trading when network is down, requires manual tuning right now via nginx
- [] Easy mocknet setup with custom build
- [] Community dashboard including bifrost status (for mimir change inputs)
- [] Create node-mimir for start commands? (In addition to regular mimir for now, transition period)
- [] Investigate THORChain sync speedup