Interrupt handling during a task that involves cascache grpc processes is very flaky
Summary
Manually interrupting a task/job that involves interaction with local cascache's grpc calls (such as a fetch, which in turn leads to source caching) very rarely leaves bst in a state to continue once the user has instructed the interrupt to exit & continue as an unhandled exception is thrown. I've also seen this occur on non source cache operations, such as when element artifacts are being cached.
Steps to reproduce
- Build something that isn't cached (I usually uses /docs/examples/autotools hello.bst for testing)
- ctrl+c out when the UI is showing that fetch jobs are on-going
- submit
c
to continue & release the interrupt
What is the current bug behavior?
[//]: # An unhandled grpc exception is thrown leaving bst in a state which it can't recover from
What is the expected correct behavior?
[//]: # bst handles the exception and can continue operating
Relevant logs and/or screenshots
[--:--:--][96c13701][ fetch:base/alpine.bst ] START autotools/base-alpine/96c13701-fetch.18085.log
[--:--:--][f3b6a558][ fetch:hello.bst ] START autotools/hello/f3b6a558-fetch.18087.log
[--:--:--][96c13701][ fetch:base/alpine.bst ] START Fetching https://bst-integration-test-images.ams3.cdn.digitaloceanspaces.com/integration-tests-base.v1.x86_64.tar.xz
[--:--:--][f3b6a558][ fetch:hello.bst ] START Fetching http://ftpmirror.gnu.org/gnu/automake/automake-1.16.tar.gz
[00:00:02][96c13701][ fetch:base/alpine.bst ] SUCCESS Fetching https://bst-integration-test-images.ams3.cdn.digitaloceanspaces.com/integration-tests-base.v1.x86_64.tar.xz
[--:--:--][96c13701][ main:base/alpine.bst ] STATUS Fetch suspending
[--:--:--][f3b6a558][ main:hello.bst ] STATUS Fetch suspending
User interrupted with ^C
Choose one of the following options:
(c)ontinue - Continue queueing jobs as much as possible
(q)uit - Exit after all ongoing jobs complete
(t)erminate - Terminate any ongoing jobs and exit
Pressing ^C again will terminate jobs and exit
Choice: [continue]: c
Continuing
[--:--:--][96c13701][ main:base/alpine.bst ] STATUS Fetch resuming
[--:--:--][f3b6a558][ main:hello.bst ] STATUS Fetch resuming
[00:00:02][f3b6a558][ fetch:hello.bst ] SUCCESS Fetching http://ftpmirror.gnu.org/gnu/automake/automake-1.16.tar.gz
[00:00:08][f3b6a558][ fetch:hello.bst ] BUG Fetch
An unhandled exception occured:
Traceback (most recent call last):
File "/home/tom/.local/lib/python3.7/site-packages/buildstream/_scheduler/jobs/job.py", line 775, in child_action
result = self.child_process() # pylint: disable=assignment-from-no-return
File "/home/tom/.local/lib/python3.7/site-packages/buildstream/_scheduler/jobs/elementjob.py", line 105, in child_process
return self._action_cb(self._element)
File "/home/tom/.local/lib/python3.7/site-packages/buildstream/_scheduler/queues/fetchqueue.py", line 87, in _fetch_not_original
element._fetch(fetch_original=False)
File "/home/tom/.local/lib/python3.7/site-packages/buildstream/element.py", line 2192, in _fetch
self.__cache_sources()
File "/home/tom/.local/lib/python3.7/site-packages/buildstream/element.py", line 3095, in __cache_sources
self.__sourcecache.commit(source, [])
File "/home/tom/.local/lib/python3.7/site-packages/buildstream/_sourcecache.py", line 154, in commit
vdir.import_files(tmpdir, can_link=True)
File "/home/tom/.local/lib/python3.7/site-packages/buildstream/storage/_casbaseddirectory.py", line 391, in import_files
result=result, can_link=can_link)
File "/home/tom/.local/lib/python3.7/site-packages/buildstream/storage/_casbaseddirectory.py", line 281, in _import_files_from_directory
can_link=can_link)
File "/home/tom/.local/lib/python3.7/site-packages/buildstream/storage/_casbaseddirectory.py", line 295, in _import_files_from_directory
can_link=can_link)
File "/home/tom/.local/lib/python3.7/site-packages/buildstream/storage/_casbaseddirectory.py", line 142, in _add_file
entry.digest = self.cas_cache.add_object(path=path, link_directly=can_link)
File "/home/tom/.local/lib/python3.7/site-packages/buildstream/_cas/cascache.py", line 467, in add_object
local_cas = self._get_local_cas()
File "/home/tom/.local/lib/python3.7/site-packages/buildstream/_cas/cascache.py", line 111, in _get_local_cas
capabilities.GetCapabilities(remote_execution_pb2.GetCapabilitiesRequest())
File "/home/tom/.local/lib/python3.7/site-packages/grpc/_channel.py", line 562, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/home/tom/.local/lib/python3.7/site-packages/grpc/_channel.py", line 466, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Connect Failed"
debug_error_string = "{"created":"@1567609606.864711857","description":"Failed to create subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":2715,"referenced_errors":[{"created":"@1567609606.825517201","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":456,"referenced_errors":[{"created":"@1567609606.825507706","description":"Connect Failed","file":"src/core/ext/filters/client_channel/subchannel.cc","file_line":963,"grpc_status":14,"referenced_errors":[{"created":"@1567609606.825475671","description":"Connection refused","errno":111,"file":"src/core/lib/iomgr/tcp_client_posix.cc","file_line":313,"os_error":"Connection refused","syscall":"connect"}]}]}]}"
Other relevant information
[//]: Occuring on bst master, 15ae3438