As part of the CAS cleanup effort, and to speed up FindMissingBlobs(), we should add an index to Buildgrid CAS. See the linked mailing list post for details on how the index would be useful for cleanup.
One small note--in the ML thread, I mentioned this:
One question that I haven’t figured out the answer to is whether the index should sit in between the CAS server and the CAS backend or whether it should sit off to the side. In other words, should the CAS server "consult" the index for locations of blobs, whether blobs exist, etc. and then talk to the CAS backend; or should it treat the index as a CAS implementation and just forward all of the requests to the index, and let the index do the communication with the CAS backend? I think both are valid approaches.
I think it would be best to have the index sit between the CAS server and the backend to reduce the complexity of changes to the CAS server.
Define the interface for the index layer
- This could be very similar to the current interface between the CAS server and the storage backend. It might need a couple of additions, such as separate functions to delete entries from the index and the backend.
Provide a preliminary implementation for an index in SQL
- Keep cleanup in mind when designing the schema (see the "Functions" section of the ML post)
- We can use SQLAlchemy for the database interface since the plan is to use that for bounceability as well
- Add yaml configuration support for selection of the index layer
- Make FindMissingBlobs only reach out to the index and not the backend (this might be trivial)
- Provide a SQLite implementation that works "out of the box" -- in other words, a user should be able to just specify a path to a database file on disk, and the index should be able to create the database file and tables if necessary and work with that
All of the above items are complete with tests where appropriate. The index (and in particular, the SQL schema) has been documented.