Should we implement a (remote) CAS-based SourceCache?

changed the description

5. Remote Execution

As a user with access to a remote SourceCache, I want to fetch only what is strictly necessary to perform remote builds.

Background: for remote execution, as we only require Directory to perform virtual staging. We can avoid the transfer of actual source files from the remote SourceCache.

I think that with CAS in play, which we intend to rely on for all artifact caches, we might strongly consider coupling the artifact cache with the source cache, or leave that decision to the artifact cache itself (at install / configure time).

The use case for sharing sources and for sharing artifacts are similar enough, and it would simplify user configuration this way.

Whether you really want to use the possibly existing cache server for source fetches is another question, using the regular upstream fetches can sometimes be quicker since you distribute a lot of the upload; with SourceCache existing and being the priority in a regular use scenario, then all users will default to downloading all their sources from the same server (a lot of the time).

This means that you may have one central bottleneck to all of your builds, maybe this can be mitigated if we investigate having the client interact with multiple artifact caches (artifact mirrors of a sort) ?

I see you've listed use cases above which include different ways for BuildStream to behave; is it expected to have the user make the configuration decision on a per project basis ? Or, does it make sense to have the project declare a behavior ?

Additional Notes

Technically, Elements must now maintain a Source Cache Key, which encapsulates all details about how a source is staged and what the source is. This part of the element's cache key basically now must function independently, and this key is what is used to store and fetch staged build directories from the SourceCache.

The files stored under the element's source cache key must be the collection of all the element's sources; this should not and cannot be done by addressing single staged Sources. This is because the result of staging something can depend on what was already there (e.g. patch sources, or the soon coming SourceTransform plugin variety).

I should note that your default use case example is false, or I fail to understand it:

if the source is not present in the local cache

i) Fetch from remote SourceCache

ii) Store into the local cache

This looks like you intend to download a git checkout of a specific commit sha from a remote SourseCache, and then use that to create a git repository in the local source cache, at ~/.cache/buildstream/sources/git/...; I presume you must have meant something else.

Question: Do the default local use cases need to involve using a CAS SourceCache at all ?

Seems to me that I don't mind spending a bit more time staging directly from a tarball or a git repo, if it means I don't have to keep an additional cache of recently used checkouts lying around on my disk - then again, if it really does save me time, and the quota is configurable, I might reconsider.

Whether you really want to use the possibly existing cache server for source fetches is another question, using the regular upstream fetches can sometimes be quicker since you distribute a lot of the upload; with SourceCache existing and being the priority in a regular use scenario, then all users will default to downloading all their sources from the same server (a lot of the time).

That distribution may already not be a factor if everything is coming from a single VCS location anyway. The same would hold true in the case of bst mirror. Those would need to scale in a similar fashion.

This means that you may have one central bottleneck to all of your builds, maybe this can be mitigated if we investigate having the client interact with multiple artifact caches (artifact mirrors of a sort) ?

That is a problem that can be tackled on a different level, and is more a question of how to scale an artifact cache. You could use anycast, enabling routing to the nearest artifact cache. You could play DNS tricks. You could use load balancing in front of the artifact cache. I'm expecting that when you need this type of scalability, you're already looking at BuildGrid.

I see you've listed use cases above which include different ways for BuildStream to behave; is it expected to have the user make the configuration decision on a per project basis ? Or, does it make sense to have the project declare a behavior ?

I would expect this at the project level.

Additional Notes

Technically, Elements must now maintain a Source Cache Key, which encapsulates all details about how a source is staged and what the source is. This part of the element's cache key basically now must function independently, and this key is what is used to store and fetch staged build directories from the SourceCache.

Correct. I imagine this key is a subset of our weak artifact key; the aggregate of source identities.

The files stored under the element's source cache key must be the collection of all the element's sources; this should not and cannot be done by addressing single staged Sources. This is because the result of staging something can depend on what was already there (e.g. patch sources, or the soon coming SourceTransform plugin variety).

That is my understanding as well. The Source Cache Key is for the entire element.

I should note that your default use case example is false, or I fail to understand it:

if the source is not present in the local cache

i) Fetch from remote SourceCache

ii) Store into the local cache

This looks like you intend to download a git checkout of a specific commit sha from a remote SourseCache, and then use that to create a git repository in the local source cache, at ~/.cache/buildstream/sources/git/...; I presume you must have meant something else.

I imagine that this is what happens under the hood for i):

calculate the Source Cache Key
contact the SourceCache service and resolve the Source Cache Key to a key in CAS, representing the root directory of the element source
recursively fetch the directory and file nodes, not already present in local CAS, from remote CAS and store them in local CAS

~/.cache/buildstream/sources/git/... remains untouched. I imagine that you would only see this populated for sources part of elements that you are tracking or where you are using workspaces.

Question: Do the default local use cases need to involve using a CAS SourceCache at all ?

Quoting the proposal for Proposal for Remote Execution (emphasis mine):

As a third phase I'm proposing to introduce a new FUSE layer that allows safe access to a directory tree stored in CAS without having to extract it to a regular file system. This is expected to reduce staging time with local execution.

That doesn't necessarily answer whether this is needed.

Seems to me that I don't mind spending a bit more time staging directly from a tarball or a git repo, if it means I don't have to keep an additional cache of recently used checkouts lying around on my disk - then again, if it really does save me time, and the quota is configurable, I might reconsider.

I would argue that you don't have those tarballs or git repos around at all for the majority of elements.

Isn't this type of topics wide and deep enough to go to the mailing list instead of the ticketing system?

changed the description

@tristanvb :

is it expected to have the user make the configuration decision on a per project basis? Or, does it make sense to have the project declare a behavior?

Yes, I think it's essential that we configure this on a per project basis, which I believe can be specified in either the ~/.config/buildstream.conf file or the project.conf of the project itself. Just like we do with artifact servers here and here, respectively.

However, how and where we specify which combination of caches/upstream to use is still up for discussion. My preference would be to specify the configuration within the project.conf file, as it makes sense to keep "per project" config options within the project itself.

@tristanvb, regarding your other questions, I have nothing more to add to @sstriker's response.

changed milestone to %BuildStream_v1.4

Summary from the gathering:

SourceCache alone doesn't solve the mirroring use case, however, it's still useful to avoid fetches using CAS. And sources are anyway needed in CAS for remote execution.
The fetch job will attempt to fetch sources from the remote SourceCache and store it in the local CAS. If not available, it will fetch from the original source repository and also store it in the local CAS.
In a first step sources will always be fetched for elements that are scheduled for build, even if they are built remotely. This will be optimized in a second step.
~/.cache/buildstream/sources will still be used.
Use ReferenceStorage to map source keys to CAS directory trees.
Sources fetched from the original source repository are pushed to the remote SourceCache, if the user has push access.

Sources fetched from the original source repository are pushed to the remote SourceCache, if the user has push access.

Will bst fetch automatically do this? I.e. bst fetch foo.bst, will:

Try to fetch source from CAS, if not there
Fetch the source from upstream
Push to SourceCache as well as store in local

Been having a look at how we might do this today and I think it might be sensible to do some refactoring of the CASCache and move parts of the functionality into CASRemote. There is quite a lot that involves passing a remote to CASCache methods, and using stubs there. We want to reuse a lot of this functionality, but we don't want to write anything to a local CAS, as we are gonna be using the current source cache set up.

We also might want to either move those parts or rename the _artifactcache directory as they no longer are necessarily to do with the artifact cache. That should make solving this issue easier as we can use CASRemote without reimplementing a bunch of code.

Any thoughts?

Moving communication with the remote server to CASRemote sounds sensible to me. Make sure to coordinate with @finnball as he might be doing something similar for #801 (closed).

Also keep in mind that this area might change longer term if we decide to use buildbox-casd in the future in BuildStream as well. See https://lists.buildgrid.build/pipermail/buildgrid/2018-November/000069.html

We also might want to either move those parts or rename the _artifactcache directory as they no longer are necessarily to do with the artifact cache. That should make solving this issue easier as we can use CASRemote without reimplementing a bunch of code.

Yes, it probably makes sense to move things around a bit. We could add a _cas directory, but then the question would be whether the _artifactcache directory still makes sense. Side-note: If we decide to move/rename files, we should always use separate commits for this without any code changes (except necessary import changes).

Also keep in mind that this area might change longer term if we decide to use buildbox-casd in the future in BuildStream as well. See https://lists.buildgrid.build/pipermail/buildgrid/2018-November/000069.html

Would that effect how we deal with a remote CAS on the buildstream side?

Yes, it probably makes sense to move things around a bit. We could add a _cas directory, but then the question would be whether the _artifactcache directory still makes sense. Side-note: If we decide to move/rename files, we should always use separate commits for this without any code changes (except necessary import changes).

I'd go with no, and just have artifactcache.py file in the _cas directory.

I might create a separate issue to deal with this refactor, so we can get part of this in first, unless anyone's against that.

mentioned in issue #802 (closed)

marked this issue as related to #802 (closed)

mentioned in merge request !1013 (closed)

Should we implement a (remote) CAS-based SourceCache?

Preliminary

Overview

Technical details

Default use-case

Further caching use-cases

1. Preference to remote SourceCache (but allowing access to upstream)

2. Only use upstream sources

3. Only local

4. No accessing upstream sources

5. Remote Execution

Child items ...

Activity

5. Remote Execution

Additional Notes

Additional Notes