Directory Heap Changes
PROGRAMMING TASK
Description of Task
The goal is to update the directoryHeap
so that it can also contain file and chunk information in order to reduce the amount of duplicated data reading during the building of the uploadHeap
.
Reason or Need for Change
Right now every time we add chunks to the uploadHeap
we pop a directory from the directoryHeap
and then read all the files and all the chunks from that directory in order to find the chunks that need to be repaired. As we further optimize the upload and repair code to ensure we are always working on the most in need chunks, we will encounter situations where we only want to add a few chunks from a directory and then move on to other directories. When this happens we will be reading the same data from the directory multiple times as we work through adding all the chunks. To avoid this we need some way to cache the file and chunk information of the directory.
Design / Proposal
Currently the directoryHeap
contains directories
that have the following structure:
directory struct {
aggregateHealth float64 // Health used to sort directory when unexplored
health float64 // Health used to sort directory when explored
explored bool // Indicates whether or not we have explored the directory and added the subdirectories
siaPath SiaPath
}
I propose we expand this element to include the following fields:
directory struct {
fileHeap fileHeap
chunkHeap uploadHeap
}
To start, the intent of these fields is to cache the information we are reading and building from the directory when we try and add chunks to the uploadHeap
.
The Repair Loop starts by popping off an explored directory from the directoryHeap
. If this directory needs repair we then read all the files from that directory. When we read these files we should store them in a fileHeap
which would be a max heap sorted by the file's health. At this point we can ignore and drop any healthy files. By adding the files that need repair to a fileHeap
in the directory element we can avoid having to read all the files from disk again if we have to add the directory back to the directoryHeap
due to the uploadHeap
being full.
Now that we have a fileHeap
, when we go to build the unfinishedUploadChunks
we can start by popping off the worst health file from the fileHeap
so we know we are starting with the worst chunk(s). As we build chunks from the files in the directory we can add them to the chunkHeap
in the directory as a cache. This ensure we are adding the worst health chunks to the uploadHeap
first. Since we are caching the chunks in the directory we can also be checking the directory's fileHeap
and the directoryHeap
to ensure we are still adding chunks that have a worse health than any other chunk in the file system. If we find that there is a worse health file in the directory's fileHeap
or a worse health directory in the directoryHeap
we can stop where we are and either grab the worse health file from the directory's fileHeap
or add the directory back to the directoryHeap
with an update health. Now when we eventually come back to this directory we can see that there are files in the fileHeap
and chunks in the chunkHeap
so we go directly to adding chunks to the uploadHeap
and avoid re-reading all the files from the directory and go directly to adding chunks to the uploadHeap
.