Skip to content

Ensure that duplicates are not processed in CAS batch methods

Rohit Kothur requested to merge rkothur/unique-digests into master

Description

This MR addresses two distinct but related problems with regards to duplicate digests.

The first problem is that we should not be wastefully processing duplicate digests in batch requests, and if a digest appears more than once, it should only receive a single response. Clients sending duplicates will need to be aware of BuildGrid's behavior here and not simply rely on the positioning of items in the response (note that this is already the case for BatchReadBlobs). Before this MR, clients sending duplicate blobs would also cause BuildGrid with the Postgres-backed index to raise an error because Postgres's upsert disallows duplicates.

We generally shouldn't need to worry about duplicates with these requests regardless of the backend, so this MR adds logic at the service level to handle this problem. It introduces a helper function, get_unique_objects_by_attribute, which can filter a list of objects based on uniqueness of some attribute of that object. This is needed because Digest objects are mutable and therefore not hashable, so we can't just convert the input list of Digests to a set. We use this function to sanitize the lists of digests in the inputs of BatchReadBlobs, and FindMissingBlobs, and to sanitize the list of blob/digest requests in BatchUpdateBlobs.

The second problem is that, in the SQL backend, there are currently two methods that add items to the SQL store: _save_digests_to_index and _add_index_entry. The _save_digests_to_index method was updated in another MR to provide better input sanitization, and for Postgres in particular it delegates to an upsert. The _add_index_entry method has none of these properties and is therefore unsafe. It was called by the ByteStream Write and Read methods, which meant it would fail if you tried to add a blob a second time using ByteStream Write. This MR removes _add_index_entry and updates its callers to use _save_digests_to_index instead.

Validation

In addition to checking the unit tests, you can also use the stress-testing docker-compose files with Postgres to validate this MR. On master, you should see several errors complaining about duplicate blobs during ON CONFLICT DO UPDATE. On commit SHA b1797475 of this MR, these errors will be gone but replaced with errors complaining about duplicate digests in general. On commit SHA 66c3f50a, all of these errors should be gone.

The command I used to test is

STRESS_TESTING_BGD_CONFIG=bgd-pges-indexedcas.yml \
STRESS_TESTING_GRACE_PERIOD=5 \
STRESS_TESTING_CLIENT_REQUESTS=2 \
 docker-compose -f pges-recc-bbrht.yml up \
  --scale buildgrid=1 --scale bots=10 \
  --scale clients=10 --scale database=1
Edited by Rohit Kothur

Merge request reports

Loading