Currently when staging things we walk the extracted artifact frequently using system calls.
Since we have a system for reading back metadata that we encode at build time, we should be using that instead.
First, the Element needs to offer an API for fetching the manifest; calling that API before the element's artifact is in the local cache should cause an error. This API is implemented by storing a new manifest.yaml in the artifact metadata.
Then, we just use the new API in place of utils.list_relative_paths() for reading the artifact manifest.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
I've started to look at implementing this in order to speed up staging.
I'm finding that for my large test case with roughly 100,000 files it spends quite a lot of time loading the manifest.yaml, which is roughly 1MB in size. This makes it even slower than before!
I find that the default 'round-trip loader' takes significantly longer than the 'safe loader':
$ time python3 -c 'from ruamel.yaml import YAML; yaml = YAML(); f = open("/path/to/manifest.yaml"); d = yaml.load(f);'real 0m19.793suser 0m19.700ssys 0m0.110s$ time python3 -c 'from ruamel.yaml import YAML; yaml = YAML(typ="safe"); f = open("/path/to/manifest.yaml"); d = yaml.load(f);'real 0m0.787suser 0m0.740ssys 0m0.040s
Will it be acceptable to use something other than the round-trip loader for manifest.yaml? It doesn't feel like it would benefit from round-tripping.
After more disappointing performance, I discovered it makes a big difference
how you construct your SafeLoader:
$ time python3 -c 'from ruamel.yaml import YAML; yaml = YAML(typ="safe"); f = open("/path/to/manifest.yaml"); d = yaml.load(f);'real 0m0.787suser 0m0.740ssys 0m0.040s$ time python3 -c 'from ruamel import yaml; f = open("/path/to/manifest.yaml"); d = yaml.load(f, yaml.loader.SafeLoader);'real 0m12.256suser 0m12.180ssys 0m0.060s
The docs showing the 'YAML' usage also say this:
This is the new (0.15+) interface for ruamel.yaml, it is still in the
process of being fleshed out. Please pin your dependency to ruamel.yaml<0.15
for production software.
I can see that with everything else being yaml, you may not be keen on that inconsistency :)
Yeah really not; I suppose however, so long as we pin ruamel to a given specific version, it might be acceptable to pin it to a specific version that is >= 0.15 for the sake of using that API in places where it counts - this gives us a future path forward without breaking artifact format.
I think that for the huge lists you are talking about 0m0.787s vs 0m0.050s is a negligible difference.
Also, I considered just a flat file with a list of paths, but that would be quite undesirable if and when we want to augment that data with attribute information (permissions, xattrs, etc).
You might want to compare the differences in load times:
Not good, it's a list that is not very forward compat:
I had a quick go at adapting list_relative_paths() to use Python 3.5's os.scandir(), it took roughly 0.5 seconds to run instead. We're supporting minimum Python 3.4 though, so not much help.
Following on from your message on the mailing list, does this mean would you accept a patch which read and wrote manifest.yaml using Python's json module? The idea being that we'd just be using it as an optimised yaml parser behind the scenes.
Ok so - while it does mean we can live with a less desirable metadata until such a time we can change to nicely formatted yaml, do we have to jump on that opportunity right away ?
Is this immensely pressing right now ?
Also what about the denormalization approach of using a flat lists also mentioned in that thread, is this an option before we resort to this ?
It would be interesting to know if the cas based cache and the virtual directory work done to support remote execution affect this at all, and if so if it's in a positive or negative way.