Race on temporary file in .cache/buildstream/cas/
Summary
When building a large project, where the cache volume gets near full, there is a crash when trying to get the file size of a since-removed temporary file in the cascache directory.
Steps to reproduce
Contrive it such that a CacheSizeJob is running at the same time as there is a temporary file in the cascache directory.
Such a file is created whenever BuildQueue has a finished job - the BuildQueue will update the cache size by adding the finished artifact size. Sometimes the CacheSizeJob will notice the temporary file that CASQuota._write_cache_size()
uses to atomically update the size file, and try to read it after it is deleted.
What is the current bug behavior?
As reported by @BenjaminSchubert:
File "/home/bschubert/.local/lib/python3.6/site-packages/buildstream/_scheduler/jobs/job.py", line 425, in _child_action
result = self.child_process() # pylint: disable=assignment-from-no-return
File "/home/bschubert/.local/lib/python3.6/site-packages/buildstream/_scheduler/jobs/cachesizejob.py", line 31, in child_process
return self._casquota.compute_cache_size()
File "/home/bschubert/.local/lib/python3.6/site-packages/buildstream/_cas/cascache.py", line 1065, in compute_cache_size
new_cache_size = self.calculate_cache_size()
File "/home/bschubert/.local/lib/python3.6/site-packages/buildstream/_cas/cascache.py", line 1080, in calculate_cache_size
return utils._get_dir_size(self.casdir)
File "/home/bschubert/.local/lib/python3.6/site-packages/buildstream/utils.py", line 631, in _get_dir_size
return get_size(path)
File "/home/bschubert/.local/lib/python3.6/site-packages/buildstream/utils.py", line 624, in get_size
total += f.stat(follow_symlinks=False).st_size
FileNotFoundError: [Errno 2] No such file or directory: '/home/bschubert/.cache/buildstream-cache/buildstream/cas/tmpfa6k35pq'
What is the expected correct behavior?
No crashing :)
Relevant logs and/or screenshots
Possible fixes
- Using the sibling
tmp
directory for the temporary file, instead of thecas
directory.
We also need to make sure that this isn't a symptom of a larger problem of data races.
Other relevant information
- BuildStream version affected: /milestone %BuildStream_v1.x