Skip to content

Resolve "Optimize the s3 storage driver's Walk function"

Allow the s3 driver's walk function to run the WalkFn in parallel.

Standard concurrently techniques were difficult to apply in this instance due to the recursive nature of Walk, along with its employment of a callback function.

In the event than an error is encountered WalkFns that have not yet been called will not be called and new goroutines will be be spawned. This is accomplished by adapting a technique found on the Go Blog Under Explicit Cancellation:

This approach has a problem: each downstream receiver needs to know the number of potentially blocked upstream senders and arrange to signal those senders on early return. Keeping track of these counts is tedious and error-prone.

We need a way to tell an unknown and unbounded number of goroutines to stop sending their values downstream. In Go, we can do this by closing a channel, because a receive operation on a closed channel can always proceed immediately, yielding the element type's zero value.

This means that main can unblock all the senders simply by closing the done channel. This close is effectively a broadcast signal to the senders. We extend each of our pipeline functions to accept done as a parameter and arrange for the close to happen via a defer statement, so that all return paths from main will signal the pipeline stages to exit.

Changes Needed to Accommodate this MR

The changes in this MR require that all functions passed to the storage driver's Walk method (and derivative functions) must be thread safe. The following changes should be merged and these issues resolved in before this one to insure the stability of the registry:

!26 (merged)

#4 (closed)
#5 (closed)
#6 (closed)
#7 (closed)
#8 (closed)
#9 (closed)

Current Behavior

The walk behavior of the walk function is mostly accomplished through an non-exported doWalk function, this function calls the AWS SDK's ListObjectsV2PagesWithContext which calls the provided callback function on each S3 object for (potentially multiple) paginated requests. Requests are limited by a provided prefix. These objects are then inspected for information about the files and directories they represent, appended to a slice, and iterated over.

For each iteration the callback function is called on the file.

On error, pagination is stopped and the error reported to Walk.

If the file is a directory, doWalk is called recursively on this directory, using this directory's prefix in order to descend down the filesystem from this point.

For each page of results, all directories and their subdirectories must be fully walked and each file in the request must be evaluated before the next request is made.

New Behavior

Within the doWalk function, every object from a page of results from ListObjectsV2PagesWithContext is evaluated within it's own goroutine, the next page of results is requested as soon as each goroutine has been spawned.

doWalk has been updated to run in a thread-safe manner. If the file is a directory, doWalk will be spawned in a new goroutine, and the parent doWalk is free to exit before its child goroutines it spawns resolve.

Walk will wait for all goroutines spawned by doWalk and its recursive calls to resolve before returning. In this way, calls to Walk remain synchronous, with the exception that the callback function must be thread-safe.

Errors are reported on an error channel. As soon as the first error is encountered, new goroutines exit immediately and doWalk calls request no further pages. This error is ultimately reported by the Walk function.

Edited by Hayley Swimelar

Merge request reports