Queues do too much work during scheduling

Description

James and I noted that his build was very very slow, indeed bst was consuming 100% of a CPU as the parent process, and the child processes were usually mostly asleep. There was a perceptible pause every time a job completed, spanning perhaps 750 to 1000 milliseconds. We straced and noted it was statting refs in the cache a LOT and so we decided to investigate deeper.

By examination of the source code, anything calling Element._update_state() can cause something to poke the cache up-to twice. Mostly that's when the element actually changes state (e.g. pull complete, build compelete, etc) but it's also called whenever the job queues process things (at least most thereof).

By examination, every time an element job of any kind completes, all the waiting elements in all the queues end up examined for their status which results in a call to Element.update_state() and thus a number of stat() calls.

I added support to log every traceback from CasCache.contains() along with the element name and whether the result was present/absent. Each traceback went to a new file named by the pid and a monotonic counter. This allowed us to easily count the number of times CasCache.contains() was called, for which elements/cache-keys, and through what pathway in the code.

A very brief analysis was then undertaken which determined that the vast majority of calls to CasCache.contains() come via Queue.pop_ready_jobs()

  File "/home/danielsilverstone/buildstream/buildstream/_scheduler/jobs/job.py", line 583, in _parent_child_completed
    self._scheduler.job_completed(self, success)
  File "/home/danielsilverstone/buildstream/buildstream/_scheduler/scheduler.py", line 240, in job_completed
    self._schedule_queue_jobs()
  File "/home/danielsilverstone/buildstream/buildstream/_scheduler/scheduler.py", line 326, in _schedule_queue_jobs
    queue.pop_ready_jobs() for queue in reversed(self.queues)
  File "/home/danielsilverstone/buildstream/buildstream/_scheduler/scheduler.py", line 326, in <genexpr>
    queue.pop_ready_jobs() for queue in reversed(self.queues)
  File "/home/danielsilverstone/buildstream/buildstream/_scheduler/queues/queue.py", line 228, in pop_ready_jobs
    status = self.status(element)
  File "/home/danielsilverstone/buildstream/buildstream/_scheduler/queues/buildqueue.py", line 75, in status
    element._update_state()
  File "/home/danielsilverstone/buildstream/buildstream/element.py", line 1137, in _update_state
    self.__strong_cached = self.__artifacts.contains(self, self.__strict_cache_key)
  File "/home/danielsilverstone/buildstream/buildstream/_artifactcache/cascache.py", line 99, in contains
    traceback.print_stack(file=fh)

The test undertaken

Buildbox project had 78 elements to build.

The total number of times that we looked for an element ref in the cache was 13853 and the most common pathway was related to the buildbox.bst element coming via BuildQueue.status() from Queue.pop_ready_jobs() called 233 times (note: 78*3 == 234). The number of times that a traceback included that function was 12955.

In addition, we revisited James' situation (roughly 21000 elements, all gathered into a single stack) and considered mechanisms to improve matters. By writing a crude wrapper script to read the stack and then issue bst build for only 100 goal elements at a time, the build queue was limited to a maximum of around 1200 elements (worst observed) and no perceptible pause was present when jobs completed.

Summary

It is very hard to cache within the main process whether or not a previously uncached element is now cached in the CAS.
Despite building a topological map of the pipeline, the main process does not have any mechanism to use that data to reduce work done whenever a job completes.
It would be valuable to have a limited-propagation event system which allows the scheduler to inject changes in the state graph into the elements, having them propagate state changes only as far as they are relevant.
Until this is done, Buildstream is crippled in situations where there are large numbers of elements present in the queues, or where the cache is on a high latency filesystem (such as NFS), and where jobs complete quickly.
The above situation is not going to be rare. Especially when remote execution allows for tens or hundreds of workers to run simultaneously and where large fetch queues are able to run quickly thanks to small inputs and fast network connectivity.

Steps to reproduce

Add this patch:

diff --git a/buildstream/_artifactcache/cascache.py b/buildstream/_artifactcache/cascache.py
index 3e63608b..db47ad3d 100644
--- a/buildstream/_artifactcache/cascache.py
+++ b/buildstream/_artifactcache/cascache.py
@@ -48,6 +48,7 @@ from . import ArtifactCache
 # Limit payload to 1 MiB to leave sufficient headroom for metadata.
 _MAX_PAYLOAD_BYTES = 1024 * 1024
 
+counter = 1
 
 # A CASCache manages artifacts in a CAS repository as specified in the
 # Remote Execution API.
@@ -89,8 +90,17 @@ class CASCache(ArtifactCache):
     def contains(self, element, key):
         refpath = self._refpath(self.get_artifact_fullname(element, key))
 
+        import traceback
+        from os import getpid
+        ret = os.path.exists(refpath)
+        global counter
+        with open("/tmp/traces/{}-{}".format(getpid(), counter), "a") as fh:
+            fh.write("!!! {} {} {}\n".format(element.name, key, ret))
+            traceback.print_stack(file=fh)
+        counter = counter + 1
+
         # This assumes that the repository doesn't have any dangling pointers
-        return os.path.exists(refpath)
+        return ret
 
     def extract(self, element, key):
         ref = self.get_artifact_fullname(element, key)

Then mkdir /tmp/traces and bst build any moderately sized project. You can then look at all the tracebacks in /tmp/traces.

Other relevant information

BuildStream version affected: All

Edited Oct 10, 2018 by Daniel Silverstone

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information