Repair Updates Round 2
Repair and Upload Code
Current Workflow and Components
The repair and upload code have two main data structures, the uploadHeap
and the directoryHeap
. The uploadHeap
is a max heap (since a higher health is worse) of unfinishedUploadChunks
and the directoryHeap
is a max heap of directories.
The high level summary of the order of operations for the repair and upload code is as follows:
// Repair Needed or File Uploaded
// Add chunks to uploadHeap
// Pop directory off directory heap
// Add chunks from directory to upload heap
// Read all files in Directory (ignore healthy files)
// Build all chunks from all files, drop healthy chunks
// Repair chunks
// Refill heap once upload heap is empty or drops below 20 chunks
Problems to Solve
The following are the current problem areas that can be improved:
- Memory Usage
- Data read in order to build
uploadHeap
- Always repair worst chunks
- Always add worst chunks
Memory Usage
Currently the uploadHeap
is unbounded so for large renters and large batch uploads the memory used by the uploadHeap
can grow excessive. The memoryManager
restricts memory usage for the actual uploads but all the chunks will sit in memory waiting for their turn to be uploaded. The main opportunity here is to limit the size of the uploadHeap
. Testing shows that each uploadChunk
takes up ~2114 bytes of memory. This does not include the file or the workers which are pointers to shared resources. Starting with a size limit of 1000 chunks would help restrict the upload chunks to ~2.1MB of memory while ensuring there is sufficient back pressure for the uploads and repairs. An important feature that will need to be added with this change is the ability to add a directory back to the directoryHeap
with an updated Health if it still needs repair.
- Add 1,000 chunk size limit on the
uploadHeap
- Add ability to add directory back to
directoryHeap
if still in need of repair.Consider how uploads are handled.This includes being able to update a directory element in the heap.
Data Read
With the change to reading files from disk it is important to make sure we are optimizing the amount of data read and processed in order to build the chunk heap. The ideal is to read N chunks from disk to add N chunks to the uploadHeap
. Some steps that we take to move closer to ideal state is reducing the number of healthy files we are building chunks for and reducing the number of healthy chunks we are building. Also if we can start with the worst health files first we can ensure we are starting with the worst chunks as well. One way to do this is to have a fileHeap
that we build from the directoryHeap
. This way we can pop the worst health file first and build chunks from it. Once we hit a healthy file we can drop the rest of the heap.
EDIT: Instead of a file heap, the directoryHeap
can be updated to include directoryElements
. This heap would start with unexplored
directories, but instead pushing explored
directories back onto the heap, we can push directoryElements
. These directoryElements
would be a sets of chunks pulled out of the directory. So when more chunks are needed for the heap these directoryElements
are popped of to be added to the uploadHeap
. Created #3558 (closed) to go into more detail on this specific change.
A further reduction in data read is to use the cached health values for the siafile
. The upload and repair code is working off the health information calculated by the HealthLoop
and Bubble
which update the cached health values of the siafiles
. This means the cached values should be pretty accurate. New uploads will need to ensure that the cached values are set as to not inadvertently think new uploads are full health, ie 0.
ADD To reduce the memory bloat, if we break out of the repair loop because we see that the directoryHeap
has more in need chunks then we should zero out the uploadHeap
and start from scratch. One concern is the adding of these chunks back to the directoryHeap
or do we just pick these up on the next repair cycle when we rebuild the directoryHeap
.
-
BuildEDIT: UpdatefileHeap
directoryHeap
to havedirectoryElements
- Use cached values of
siafile
health in repair code (remove offline and goodForRenew maps)
Always Repair Worst Chunks
To ensure we are always repairing the worst chunks the uploadHeap
is a Max Heap since a higher health is a worse health. This covers us for pulling chunks off the uploadHeap
but we also want to ensure that we are still repairing the worst chunks after we have being repairing chunks for a while. For this we should be doing a check that the chunk we are repairing is still a worst health than the rest of the chunks in the directoryHeap
. If there are worse health chunks in the directory heap then we should stop repairing and go grab those chunks. One thing to look at are the edge cases around getting into a cycle of only repairing a small number of chunks and so being stuck with hundreds of chunks in the uploadHeap
that are healthier than the directoryHeap
.
- Compare worst health in
uploadHeap
against worst health indirectoryHeap
.
Always Add Worst Chunks
To ensure we are always adding the worst chunks to the uploadHeap
we add chunks from directories popped off the directoryHeap
which is also a max heap. To further ensure we are then adding the worst chunks to the uploadHeap
from this worst health directory, we can build a temporary heap of chunks to ensure we have the 1000 worst chunks in the directory. As mention above with repairing chunks, we should compare the chunks being added to the temporary heap to the worst health of the directoryHeap
to ensure we are still adding the worst health chunks in the file system. A last potential consideration is to further compare the chunks being added against the chunks currently in the uploadHeap
. This would help ensure that we are not adding a bunch of chunks that are healthier than the chunks in the uploadHeap
. However this might cause some odd edge cases where we are not adding chunks, even though they are the most in need chunks in the file system but are healthier than the chunks already in the uploadHeap
. If we are adding chunks to the uploadHeap
, something trigger this action, either because more chunks are needed, files where seen as needing repair, or files were newly uploaded. From the directoryHeap
and the temporary chunk heap, we can be confident that we are trying to add the worst chunks in the file system so it makes sense that they should be added to the uploadHeap
if there is space available in the uploadHeap
.
-
Build temp chunk heap to add chunks toEDIT: useuploadHeap
directoryElements
fromdirectoryHeap
- Compare chunks being added to the
uploadHeap
against the health of thedirectoryHeap
- TBD: compare chunks against chunks in
uploadHeap
TODO - MR Breakout
General Order:
-
Update heap building code to check cached health values of files and use piece info already calculated to not need offline
andgoodForUpload
maps !3678 (merged) -
Add logic to push directories back onto directoryHeap
. Implementupdate
method of theHeap
interface !3679 (merged) -
Update directoryHeap
per #3558 (closed) -
Add size limit to uploadHeap
and add folders back into thedirectoryHeap
if not all chunks were added. Directory health should be updated with the health of the next chunk that would have been added. For uploads, if not all chunks can be added touploadHeap
then directory will be added to heap.With this we should build temp chunk heap in order to add chunks to theuploadHeap
to ensure that we are adding the worst chunks from a directory. -
Add logic to add more directories if uploadHeap
is not full !3688 (merged)
Potentially not needed after directoryHeap
changes. Will need to re-evaluate
-
Add logic to check directoryHeap
health againstuploadHeap
health and break out of repair loop to add more in need chunks. TheuploadHeap
will be reset to avoid bloat. (Need to think through edge cases that would cause low throughput and high cycling) -
Add logic to check directoryHeap
health against health of chunks being added touploadHeap
(Need to think through edge cases that would cause low throughput and high cycling)
MRs not needing any specific order:
-
Select stuck files weighted by number of stuck chunks #3559 (closed) !3693 (merged) -
Update in code docstrings and renter module README !3563 (merged)