Pruning nodes: Take cs_main for `getblock` and `getblockstats`
Summary
Closes #212 (closed). Below is an excerpt from #212 (closed), reworded slightly:
After reviewing the code again for !849 (merged) -- it turns out a potential
regression was introduced in that MR. If you recall, that MR decided to
not keep cs_main
held while it was reading block and/or undo data from
the disk. This is normally not a problem on non-pruning nodes since
these files never go away once they exist.
However, on pruning nodes -- they may get deleted in parallel while
the getblock
or getblockstats
RPCs run.
Now, again, this wouldn't even be a problem either on a unix-like system which supports unlinking files while they are open (the file only gets physically removed from the internal filesystem data structures when the last file descriptor to it is closed).
However, on Windows the above is not the case. What's worse, is that
on a Windows pruning node, the FlushStateToDisk
internal function in
validation.cpp
which does the pruning of block files may end up
getting an exception thrown if it fails to delete the block/undo file in
question (if just the right unlucky timing happens whereby a geblock
or
getblockstats
call happens to be touching a file it wants to delete
at preciely the right moment).
The problem then would be that on a Windows pruning node that somehow is
also serving up RPC getblock
and getblockstats
requests (not a
likely scenario) -- the exception thrown in FlushStateToDisk
could
cause the node to shut down unexpectedly.
I was unable to reproduce this regression here -- but reading the code I am convinced it can happen, hence the need for this MR.
Test Plan
This MR introduces no major behavioral change. In the normal execution path basically it's the same code. In the pruning execution path, it does now take a lock when reading undo or block data from disk for the duration of the read, which in some sense is what it did anyway for all execution paths before !849 (merged) was merged.
I was unable to reproduce the regression in #212 (closed).
But the test plan would be:
ninja check check-functional
Also if you happen to have a Windows pruning node and happen to want to hammer it with RPC against master versus against this MR to see if you can get it to crash on master vs not crash with this MR, be my guest. I tried to do this for about 10 minutes here without success. You need to get really lucky to trigger this issue. (And you need to be querying a block that is "about to be pruned", which takes some art to determine.)