Directory Heap Changes

PROGRAMMING TASK

Description of Task

The goal is to update the directoryHeap so that it can also contain file and chunk information in order to reduce the amount of duplicated data reading during the building of the uploadHeap.

Reason or Need for Change

Right now every time we add chunks to the uploadHeap we pop a directory from the directoryHeap and then read all the files and all the chunks from that directory in order to find the chunks that need to be repaired. As we further optimize the upload and repair code to ensure we are always working on the most in need chunks, we will encounter situations where we only want to add a few chunks from a directory and then move on to other directories. When this happens we will be reading the same data from the directory multiple times as we work through adding all the chunks. To avoid this we need some way to cache the file and chunk information of the directory.

Design / Proposal

Currently the directoryHeap contains directories that have the following structure:

directory struct {
  aggregateHealth float64  // Health used to sort directory when unexplored
  health          float64  // Health used to sort directory when explored
  explored        bool     // Indicates whether or not we have explored the directory and added the subdirectories
  siaPath         SiaPath
}

I propose we expand this element to include the following fields:

directory struct {
  fileHeap   fileHeap
  chunkHeap  uploadHeap
}

To start, the intent of these fields is to cache the information we are reading and building from the directory when we try and add chunks to the uploadHeap.

The Repair Loop starts by popping off an explored directory from the directoryHeap. If this directory needs repair we then read all the files from that directory. When we read these files we should store them in a fileHeap which would be a max heap sorted by the file's health. At this point we can ignore and drop any healthy files. By adding the files that need repair to a fileHeap in the directory element we can avoid having to read all the files from disk again if we have to add the directory back to the directoryHeap due to the uploadHeap being full.

Now that we have a fileHeap, when we go to build the unfinishedUploadChunks we can start by popping off the worst health file from the fileHeap so we know we are starting with the worst chunk(s). As we build chunks from the files in the directory we can add them to the chunkHeap in the directory as a cache. This ensure we are adding the worst health chunks to the uploadHeap first. Since we are caching the chunks in the directory we can also be checking the directory's fileHeap and the directoryHeap to ensure we are still adding chunks that have a worse health than any other chunk in the file system. If we find that there is a worse health file in the directory's fileHeap or a worse health directory in the directoryHeap we can stop where we are and either grab the worse health file from the directory's fileHeap or add the directory back to the directoryHeap with an update health. Now when we eventually come back to this directory we can see that there are files in the fileHeap and chunks in the chunkHeap so we go directly to adding chunks to the uploadHeap and avoid re-reading all the files from the directory and go directly to adding chunks to the uploadHeap.

Edited May 17, 2019 by Matthew Sevey

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information