Skip to content

Add WalkParallel method to storage driver interface

Overview

Due to paginated API calls depending on Walk to return results in stable order, we have decided to handle parallelized walks via a new method on the storagedriver interface. See: !37 (comment 268823080)

This allows callers to determine the behavior of Walk. Namely, they are able to use the faster WalkParallel given the conditions that:

  1. Their WalkFn callback is thread-safe
  2. They intend to traverse all (or a large subset) of the files under the given path
  3. Processing order is not important

This Merge request includes changes that:

  • add the WalkParallel method to the stoagedriver interface and add implement that method on all drivers except for s3. For s3, the original Walk method is added back.
  • Adds thread-safety to the catalog_test Walk function call
  • Calls the new WalkParallel method for non-paginated walks
  • Fixes an flaw in doWalkParallel where directory skipping error flags were incorrectly reported up to callers
  • Fixes an incorrect regex in driver testsuites

Potential weaknesses

This MR updates all storage drivers to use parallel walks for WalkParallel it might be prudent to have production drivers, such as GCS, swift, etc. to fall back to sequential Walks until we can verify their behavior when running highly parallelized workloads.

In order to do this, we would need to find a work around for TestWalkParallelStopsProcessingOnError in the driver testsuites which will timeout for non-parallel Walks.

Edited by Hayley Swimelar

Merge request reports