Queues do too much work during scheduling
Description
James and I noted that his build was very very slow, indeed bst
was consuming
100% of a CPU as the parent process, and the child processes were usually
mostly asleep. There was a perceptible pause every time a job completed,
spanning perhaps 750 to 1000 milliseconds. We straced and noted it was
statting refs in the cache a LOT and so we decided to investigate deeper.
By examination of the source code, anything calling Element._update_state()
can cause something to poke the cache up-to twice. Mostly that's when the
element actually changes state (e.g. pull complete, build compelete, etc) but
it's also called whenever the job queues process things (at least most
thereof).
By examination, every time an element job of any kind completes, all the
waiting elements in all the queues end up examined for their status which
results in a call to Element.update_state()
and thus a number of stat()
calls.
I added support to log every traceback from CasCache.contains()
along with
the element name and whether the result was present/absent. Each traceback
went to a new file named by the pid and a monotonic counter. This allowed us
to easily count the number of times CasCache.contains()
was called, for which
elements/cache-keys, and through what pathway in the code.
A very brief analysis was then undertaken which determined that the vast
majority of calls to CasCache.contains()
come via Queue.pop_ready_jobs()
File "/home/danielsilverstone/buildstream/buildstream/_scheduler/jobs/job.py", line 583, in _parent_child_completed
self._scheduler.job_completed(self, success)
File "/home/danielsilverstone/buildstream/buildstream/_scheduler/scheduler.py", line 240, in job_completed
self._schedule_queue_jobs()
File "/home/danielsilverstone/buildstream/buildstream/_scheduler/scheduler.py", line 326, in _schedule_queue_jobs
queue.pop_ready_jobs() for queue in reversed(self.queues)
File "/home/danielsilverstone/buildstream/buildstream/_scheduler/scheduler.py", line 326, in <genexpr>
queue.pop_ready_jobs() for queue in reversed(self.queues)
File "/home/danielsilverstone/buildstream/buildstream/_scheduler/queues/queue.py", line 228, in pop_ready_jobs
status = self.status(element)
File "/home/danielsilverstone/buildstream/buildstream/_scheduler/queues/buildqueue.py", line 75, in status
element._update_state()
File "/home/danielsilverstone/buildstream/buildstream/element.py", line 1137, in _update_state
self.__strong_cached = self.__artifacts.contains(self, self.__strict_cache_key)
File "/home/danielsilverstone/buildstream/buildstream/_artifactcache/cascache.py", line 99, in contains
traceback.print_stack(file=fh)
The test undertaken
Buildbox project had 78 elements to build.
The total number of times that we looked for an element ref in the cache was 13853 and the most common pathway was related to the
buildbox.bst
element coming viaBuildQueue.status()
fromQueue.pop_ready_jobs()
called 233 times (note:78*3 == 234
). The number of times that a traceback included that function was 12955.
In addition, we revisited James' situation (roughly 21000 elements, all
gathered into a single stack) and considered mechanisms to improve matters. By
writing a crude wrapper script to read the stack and then issue bst build
for
only 100 goal elements at a time, the build queue was limited to a maximum of
around 1200 elements (worst observed) and no perceptible pause was present when
jobs completed.
Summary
- It is very hard to cache within the main process whether or not a previously uncached element is now cached in the CAS.
- Despite building a topological map of the pipeline, the main process does not have any mechanism to use that data to reduce work done whenever a job completes.
- It would be valuable to have a limited-propagation event system which allows the scheduler to inject changes in the state graph into the elements, having them propagate state changes only as far as they are relevant.
- Until this is done, Buildstream is crippled in situations where there are large numbers of elements present in the queues, or where the cache is on a high latency filesystem (such as NFS), and where jobs complete quickly.
- The above situation is not going to be rare. Especially when remote execution allows for tens or hundreds of workers to run simultaneously and where large fetch queues are able to run quickly thanks to small inputs and fast network connectivity.
Steps to reproduce
Add this patch:
diff --git a/buildstream/_artifactcache/cascache.py b/buildstream/_artifactcache/cascache.py
index 3e63608b..db47ad3d 100644
--- a/buildstream/_artifactcache/cascache.py
+++ b/buildstream/_artifactcache/cascache.py
@@ -48,6 +48,7 @@ from . import ArtifactCache
# Limit payload to 1 MiB to leave sufficient headroom for metadata.
_MAX_PAYLOAD_BYTES = 1024 * 1024
+counter = 1
# A CASCache manages artifacts in a CAS repository as specified in the
# Remote Execution API.
@@ -89,8 +90,17 @@ class CASCache(ArtifactCache):
def contains(self, element, key):
refpath = self._refpath(self.get_artifact_fullname(element, key))
+ import traceback
+ from os import getpid
+ ret = os.path.exists(refpath)
+ global counter
+ with open("/tmp/traces/{}-{}".format(getpid(), counter), "a") as fh:
+ fh.write("!!! {} {} {}\n".format(element.name, key, ret))
+ traceback.print_stack(file=fh)
+ counter = counter + 1
+
# This assumes that the repository doesn't have any dangling pointers
- return os.path.exists(refpath)
+ return ret
def extract(self, element, key):
ref = self.get_artifact_fullname(element, key)
Then mkdir /tmp/traces
and bst build
any moderately sized project. You can then look at all the tracebacks in /tmp/traces
.
Other relevant information
- BuildStream version affected: All