Skip to content

Replace the Host Storage Manager

David Vorick requested to merge host2 into master

After doing some benchmarks it became clear that speeds on the old storage manager were not acceptable. Unfortunately they were 1-2 orders of magnitude below acceptable, and in particular operations such as renewing a storage obligation were taking so long that it because unlikely such a process would be possible in the wild with any sort of large storage obligation.

We don't have any sort of stress testing right now, and even just a little bit would have revealed this problem.

I spent a few days learning everything there is to know about disks and I/O and memory and latency and bottlenecks, then took it upon myself to replace boltdb in the storage manager with a custom, hand-crafted database. This database makes a single disk operation, and does batching that matches acceptable latencies within the host.

The new contract manager is incomplete, as it doesn't actually store any data yet. Most of the optimizations haven't been implemented, because it takes a lot of work to create a custom ACID database. But, the guts of the database are complete and extending it to serve new functions should be pretty easy.

I have implemented a single operation - AddStorageFolder, and then written a boatload of tests around AddStorageFolder to more or less demonstrate that the ACID transactions are functional. Included in this testing is some more mocking, including a new type of mocking I'll label 'disrupt mocking'. There's a disrupt() dependency that in production always returns false. But, the code has paths such that when it returns true, either logic is skipped or a function is terminated or somehow otherwise the code-flow of the program is disrupted.

That made it easy to do things like kill AddStorageFolder halfway through.

A big part of the diff is commented out code that I copy-pasted from the storagemanager to help guide the new functions - the old storagemanager had a bad database but extensive fault-tolerance, and when I was doing the replacement I wanted to make sure that I wasn't dropping any tolerance. I'll continue to remove the commented out code as I re-implement it.

All told, I've spent probably 80 hours on this so far. Hopefully it'll speed things up enough to make the investment pay off. I'm pretty sure there will be massive performance gains.

Merge request reports