Repair Loop Updates
FEATURE REQUEST
Description of Request
Update the repair code from adding based on traversing the directory to a dual heap system.
Reason or Need for Feature
Adding chunks to the repair by finding low health directories was not creating enough back pressure causing non-optimal use of upload bandwidth. Uploads and repairs would see spikes and drop offs of bandwidth usage even when there are plenty of chunks to be uploaded.
Design / Proposal
Health Loop Changes
Currently there are no updates to the health loop needed.
Repair Loop
The repair loop will be updated to work off of two heaps, the current chunk heap and a directory heap.
Chunk Heap
No changes to the structure of the chunk heap are needed. The chunk heap will still prioritize by health and by stuck status. The one control flow update will be that when the chunk heap is at <20 chunks in size it will trigger for more chunks.
Directory Heap
The directory heap is a new structure. A Dir Heap element will look like the following"
DirElem {
health float64
dir string
explored bool
}
Like the chunk heap the dir heap will be sorted by health
. The explored
bool indicates if the directory has been popped of the heap in order to find the sub directories. When a directory is popped of the dir heap, explored
will be set to true
and it will be added back to the heap along with an element for each of its sub directories with the explored
fields set to false.
When the chunk heap needs more chunks it will pop a dir off the dir heap and if explored
is true
then it will add the chunks from that directory to the heap. For large directories, as many chunk as possible will be added and then the dir will be added back to the heap with the health
set to the health
of the next chunk to be added. This means that the chunks will need to be sorted in a temp heap so that chunks are added to the chunk heap in the right order.
Stuck Loop
Stuck loop will no longer call and block on the repairing of the chunks, it will simply add stuck chunks as triggered. The Stuck Loop flow will be:
- Add stuck chunks when a stuck file is found
- Add additional stuck chunks if a stuck chunk was successful
- Add stuck chunks every 10min if unsuccessful
TODOs
Any remaining TODO's having been moved to #3554 (closed) to allow for further discussion on the next round of changes.
-
Add control flow to the repair heap so that when it gets low, it either grabs more chunks or it blocks until stuck chunks / low health files / new files are added, create min constant !3559 (merged) -
Remove managedRepairLoop
call from the stuck loop !3551 (closed) -
Increase number of stuck chunks added to the heap !3521 (merged) -
Create the directory heap. !3557 (merged) -
Incorporate directory heap into repair logic. !3595 (merged) - Start by adding the full directory at a time and don't worry about adding too many chunks at once (initially).
- Add the logic to decide when to pull the next directory out of the directory heap.
- Add the logic to decide when to reset the directory heap vs. wait for new files. Probably that logic will be "always reset the directory heap, if we reset it and root is healthy, then we wait".
-
Remove call to bubbleMetadata
from workers and just call from end of Repair Loop iteration to reduce disk I/O !3654 (merged) -
Add a filter so that it tracks the 100 worst chunks that are in the heap, and only adds a chunk from the dir if it's got a health worse than the current 100 worst chunks. This may be a bit tricky to implement so we should talk a bit more. This involves creating a temp heap to sort the chunks to be added. !3659 (merged) -
Add logic to go fetch more directories if the health of the worst chunk in the heap is better than the health of the worst directory in the directory tree. -
Add logic to add folders back into the directory heap if not all of their chunks get used. They should be added with a health equal to the best health chunk that made it into the repair heap. Create max number of chunks added at a time constant. !3659 (merged) -
Remove RecentRepairTime
!3658 (merged) -
Update in code docstrings and renter module README -
Expand Testing - Repair test should upload without blocking large number of files, then rename a files mid upload
- Live Testing
- multi sized files, multi level directory, rename files
- one large root directory