Should we use a CAS-based source cache as a mirror?
I have been thinking about how the CAS store outlined in Jürg's Remote Execution proposal (https://mail.gnome.org/archives/buildstream-list/2018-April/msg00006.html) could be used as a mirroring solution.
DISCLAIMER: I haven't worked with gRPC or CAS before, so I may have overlooked some important implementation details
Context
CAS right now
CAS (ContentAddressableStorage) stores a file tree and can be retrieved with a key (which is based on the contents of that file tree, like a checksum). Jürg's Remote Execution proposal comes with a demonstration branch, https://gitlab.com/BuildStream/buildstream/tree/juerg/googlecas. In that branch, there is a CasCache class which includes a translation layer from an artifact's cache key to the CAS digest of the artifact's contents - from a cache key, it can store and retrieve file trees from the CAS. The branch also comes with a script that functions as a server for uploading/downloading artifacts.
CAS and sources
As I understand Jürg's proposal, sources will be stored in a CAS to:
- Allow remote workers to access the sources during build tasks.
- Once the virtual filesystem and FUSE layer are implemented, build faster locally.
A comment by Sander https://mail.gnome.org/archives/buildstream-list/2018-May/msg00069.html raised the possibility of using this SourceCache as a mirror, as well.
The current mirroring solution
The current solution for mirroring is in two parts, Generic mirrors and Buildstream-generated mirrors:
Generic mirrors
There is an issue outlining the plan for generic mirrors at #328 (closed). Briefly, we define a list of mirrors which contain mappings of aliases to URL prefixes to replace the alias with. When fetching, we consult these mirrors before the original alias, and when tracking we consult the mirrors if the original alias is unreachable.
Buildstream-generated mirrors
Issue #330 (#330) covers using buildstream to generate a mirror that other instances of buildstream can use. This involves:
- Adding a
mirror
method to each source which makes it fetch the entire repository. - Rework the way we store fetched sources so that they're in the same format as they were fetched (i.e. fetching http://foo.com/foo.tar.gz would still have foo.tar.gz)
- Incompatible sources are incrementally namespaced, i.e. we store sources in //<mangled_url>//, where
n
is the incremental namespace, starting at 0 and incrementing whenever we've fetched a source that can't be stored in the same place (e.g. different tar.gz files, or bzr repositories with different histories)
Staying with the planned mirroring solution
Staying with the mirroring solution in #330 has the advantage that it:
- Uses the same behaviour to fetch from every mirror, i.e. point the source at the right URL and call the fetch method.
- Could be used without buildstream
- A future building solution that isn't buildstream could still use buildstream's mirror (or use buildstream to generate the mirrror).
- Buildstream does not need to devise and maintain an API to interact with the mirror.
Using a CAS-based mirror
As mentioned above, should the remote execution proposal be accepted, a CAS-based service that serves up sources will exist, whether we intend to use it as a mirror or not.
If we decide to use it as a mirror, though, we have to consider tracking. There are two considerations here:
The result of tracking changes over time
For example, the head of a git branch changes to new commits, and a tar file might be replaced with a different one. A normal caching solution can allow multiple agents to push results to it, since they can assume a given input (e.g. URL and ref) will always correspond to the same output (e.g. a source tree). A mirroring solution that provides tracking information will require that a single agent frequently tracks the upstream repositories and replaces the results in the mirror.
Exposing a track key
Currently there is no way to extract what information is pertinent for tracking for each source.
I recommend adding a Source.get_track_key()
method that every source implements.
The advantage of using CAS as a mirror is:
- Write less code - It will be simpler to maintain a CAS cache that's suitable for mirroring than simultaneously maintaining a file-based cache for mirroring and a CAS-based cache for remote execution and virtual filesystems.
- Less complexity in handling generic mirrors - The CAS cache removes the need to support incremental namespacing.