Skip to content

fetch() is called on the source even though it is cached

Summary

I have an element and I have some custom source plugin I've developed, and their fetch() operation is costly.

I noticed that, if one source in the element changes, FetchQueue asks all sources to be fetched again including the ones that reported themselves Consistency.CACHED.

  • From user perspective, I don't think this should happen, it is a waste of time.
  • From plugin developer perspective, I don't want to include handle repeated calls to fetch() logic in my plugin's fetch(), instead I want to keep it only in get_consistency().

Additionally, for example, we are hitting a line we should never hit in DownloadableFileSource-based plugins: https://gitlab.com/BuildStream/buildstream/blob/a53d6d1f81913d639175cbd503d9d71ec7133a4e/buildstream/plugins/sources/_downloadablefilesource.py#L132

I was also able to observe this in git.py plugin. git and tar for example can handle this at the moment but this is still wrong.

Steps to reproduce

I used a patch like this to add some print statements - you can add your own..

diff --git a/buildstream/plugins/sources/_downloadablefilesource.py b/buildstream/plugins/sources/_downloadablefilesource.py
index f5c5b3d0..80099787 100644
--- a/buildstream/plugins/sources/_downloadablefilesource.py
+++ b/buildstream/plugins/sources/_downloadablefilesource.py
@@ -132,6 +132,7 @@ class DownloadableFileSource(Source):
         # not be called if the source is already Consistency.CACHED.
         #
         if os.path.isfile(self._get_mirror_file()):
+            self.info("defensive check hit?? in {}".format(self.name))
             return  # pragma: nocover

         # Download the file, raise hell if the sha256sums don't match,
@@ -162,6 +163,7 @@ class DownloadableFileSource(Source):
             etagfile.write(etag)

     def _ensure_mirror(self):
+        self.info("_ensure_mirror in {}".format(self.name))
         # Downloads from the url and caches it according to its sha256sum.
         try:
             with self.tempdir() as td:
diff --git a/buildstream/source.py b/buildstream/source.py
index d6681b96..f6e3c07a 100644
--- a/buildstream/source.py
+++ b/buildstream/source.py
@@ -973,6 +973,7 @@ class Source(Plugin):

     # Tries to call fetch for every mirror, stopping once it succeeds
     def __do_fetch(self, **kwargs):
+        self.info("__do_fetch called for {}".format(self.name))
         project = self._get_project()
         context = self._get_context()

@@ -1028,6 +1029,7 @@ class Source(Plugin):
             else:
                 mirrors = project.config.mirrors
             if not mirrors or not alias:
+                self.info("{} calling fetch()!".format(self.name))
                 self.fetch(**kwargs)
                 return

Prepare a simple bst file with one tar source. bst fetch it. Later add one more tar source. Tar is smart and its fetch() terminates early, but still a problem.

Do the same for git sources.

What is the current bug behavior?

One non-cached source is enough to trigger fetch() on all sources.

What is the expected correct behavior?

Buildstream should just fetch the sources that are not already cached.

Relevant logs and/or screenshots

Possible fixes

Here is the line we need to add a check, I believe. We might need to discuss if it'd be better to keep that logic in Element instead because there would be a source.get_consistency() == Consistency.CACHED check - too many details for a Queue perhaps.

Other relevant information

  • BuildStream version affected: 1.3.0+1017.g761e7570.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information