Skip to content

Pruning nodes: Take cs_main for `getblock` and `getblockstats`

Summary

Closes #212 (closed). Below is an excerpt from #212 (closed), reworded slightly:

After reviewing the code again for !849 (merged) -- it turns out a potential regression was introduced in that MR. If you recall, that MR decided to not keep cs_main held while it was reading block and/or undo data from the disk. This is normally not a problem on non-pruning nodes since these files never go away once they exist.

However, on pruning nodes -- they may get deleted in parallel while the getblock or getblockstats RPCs run.

Now, again, this wouldn't even be a problem either on a unix-like system which supports unlinking files while they are open (the file only gets physically removed from the internal filesystem data structures when the last file descriptor to it is closed).

However, on Windows the above is not the case. What's worse, is that on a Windows pruning node, the FlushStateToDisk internal function in validation.cpp which does the pruning of block files may end up getting an exception thrown if it fails to delete the block/undo file in question (if just the right unlucky timing happens whereby a geblock or getblockstats call happens to be touching a file it wants to delete at preciely the right moment).

The problem then would be that on a Windows pruning node that somehow is also serving up RPC getblock and getblockstats requests (not a likely scenario) -- the exception thrown in FlushStateToDisk could cause the node to shut down unexpectedly.

I was unable to reproduce this regression here -- but reading the code I am convinced it can happen, hence the need for this MR.

Test Plan

This MR introduces no major behavioral change. In the normal execution path basically it's the same code. In the pruning execution path, it does now take a lock when reading undo or block data from disk for the duration of the read, which in some sense is what it did anyway for all execution paths before !849 (merged) was merged.

I was unable to reproduce the regression in #212 (closed).

But the test plan would be:

  • ninja check check-functional

Also if you happen to have a Windows pruning node and happen to want to hammer it with RPC against master versus against this MR to see if you can get it to crash on master vs not crash with this MR, be my guest. I tried to do this for about 10 minutes here without success. You need to get really lucky to trigger this issue. (And you need to be querying a block that is "about to be pruned", which takes some art to determine.)

Merge request reports