Switch snapshot code to use workers
There are a bunch of unresolved comments, but the two biggest things are:
- Switch the backup process to being async, as it can take quite a while to build and restore a snapshot
- Use the existing worker system to do the backup and recovery work, instead of using a session directly
Switching to using workers should give us both parallelization and also nice things like exponential backoff.
The following discussions from !3665 (merged) should be addressed:
-
@DavidVorick started a discussion: (+2 comments) My thought is that we check every 5 minutes or so to scroll through the list of hosts and see who is marked as 'synced' and 'not synced', and only attempt to communicate with the hosts that are currently marked 'not synced'.
If you want to actually do downloads to double check, I think we only need to be doing that once a day or so. I don't expect that hosts which are in sync will suddenly go out of sync even over the course of months of uptime. I do think we should check each time the renter comes online, but I'm not sure we need to be downloading snapshots from the hosts with more frequency than just at boot.
The every 5 minutes check would be to try and connect with hosts that are offline or that we had troubles syncing. I'm wondering if 5 minutes even is potentially too short, because we're going to be marking the host for bad interactions every time they fail - most of the other code uses exponential backoffs to prevent from marking the host as bad too much, I wonder if there's some way we can work that in here.
-
@DavidVorick started a discussion: (+3 comments) We will need a test that creates a bunch of snapshots and checks that this boundary code works. I believe a sector size is only 4kb in testing so it shouldn't be that burdensome.
-
@DavidVorick started a discussion: (+5 comments) should this be a helper method?
-
@DavidVorick started a discussion: (+1 comment) well, perhaps not. If the error is host-originated, the other snapshots probably aren't going to upload either. We'd probably want to go on cooldown because otherwise the host will not be marked as synchronized and we're going to keep uploading to it.
-
@DavidVorick started a discussion: And log.Println
-
@DavidVorick started a discussion: (+1 comment) One issue with going one host at a time is that we are constantly re-downloading a snapshot for each host that is missing a snapshot. We should make sure things are coded so that if we find a new snapshot, we add it to all of the hosts at once and minimize the amount of downloading that we have to do.
We should probably log the number of hosts that aren't synchronized in this loop.