Skip to content

Indexed CAS

Rohit Kothur requested to merge rkothur/index into master

Description

This MR adds support for an index to CAS. Currently only a SQL-based implementation is provided.

Changes proposed in this merge request:

  • Add a new interface, index_abc.py, which is an extension of storage_abc.py.
  • Add a SQL implementation for the index, as well as models for the database.
  • Tests
  • Configuration support is included in !236 (merged)

Current limitations

  • The SQL index currently only supports a single storage backend. Multilevel storage can still be accomplished with the with_cache storage implementation. For increased flexibility in the future, however, we may want to allow it to support more storages.
  • As a result of the above, the table schema does not currently have a "location" field, and it only keeps track of the digest hash and size as well as the last updated time.
  • The missing blobs query currently fails if too many blobs are given to it. SQL implementations limit the number of bound parameters, with SQLite defaulting to 999 and PostgreSQL defaulting to (I think) 32767. There are also limits on query length. As a result I suspect the SQL index needs to be made aware of the SQL implementation, and it needs to split up large FindMissingBlobs requests into multiple SQL queries based on the implementation limit. I will defer this until implementation of the YAML parsing, where I can add a field to specify the implementation and break large FindMissingBlobs requests based on that. This has been handled in this MR.

TODO

  • More tests
  • SQL implementation of BatchUpdateBlobs and BatchReadBlobs.
  • Allow this to handle large number of blobs.
  • Performance testing

This merge request, when merged, will address issue/bug:

#181 (closed)

Edited by Rohit Kothur

Merge request reports