Upload Purging Algorithm Could Result in Unacceptably High Memory Usage for Very Large Registries
Context
Current Algorithm
The repository side of the filesystem it walked and all sub directories within the _uploads
directory (containing information on partial uploads) are gathered into a single map of structs containing the path and a time representing the start of the upload. That map is then iterated over, and that upload is determined to be old enough to be deleted, it is and the path is appended to a slice which is returned. This result is used only for testing currently.
Issues
The largest issue is that all candidates for deletion are gathered and stored in memory before deletion occurs. For large repositories, this could result in several GiB stored in memory for the entire direction of a sweep.
The result slice could also lead to memory usage problems; however, this will typically represent a subset of all candidates for deletion and will not contain timestamp information.
Additionally, it should be possible to validate whether an upload directory can be directory at the point at which is it encountered during the walk, reducing the both the size of deletion candidates and removing the need to store timestamp data alongside the path.
Possible Solutions
-
Process results in place: Since we should be able to evaluate whether to keep a directory or not at the time it's encountered we could delete at that time as well. This prevents us from needing to store candidate information at all. This would break many of the current tests for the upload purger, so although it results in the simplest code effort will be to spent rewriting the tests.
-
Pipelines: Deletion candidates could be selected in one goroutine and sent to another which deletes them. This would reduce the amount of "in flight" candidates, and allow us to adapt the existing test suite far more easily. The results could similarly be send to a channel, allowing us to evaluate the results for the tests and discard them for when they're ran from the application.
-
Database: With the metadata database, it would be possible to purge uploads on a per repository basis, rather than walking the entire metadata half of the filesystem. This would reduce issues of scale for the purging operation considerably, although this would result in minimal reuse of existing code.