Store artifact manifests in metadata

I've started to look at implementing this in order to speed up staging.

I'm finding that for my large test case with roughly 100,000 files it spends quite a lot of time loading the manifest.yaml, which is roughly 1MB in size. This makes it even slower than before!

I find that the default 'round-trip loader' takes significantly longer than the 'safe loader':

$ time python3 -c 'from ruamel.yaml import YAML; yaml = YAML(); f = open("/path/to/manifest.yaml"); d = yaml.load(f);'

real    0m19.793s
user    0m19.700s
sys     0m0.110s

$ time python3 -c 'from ruamel.yaml import YAML; yaml = YAML(typ="safe"); f = open("/path/to/manifest.yaml"); d = yaml.load(f);'

real    0m0.787s
user    0m0.740s
sys     0m0.040s

Will it be acceptable to use something other than the round-trip loader for manifest.yaml? It doesn't feel like it would benefit from round-tripping.

Will it be acceptable to use something other than the round-trip loader for manifest.yaml? It doesn't feel like it would benefit from round-tripping.

Sure !

Only nitpick I might have is, please make it something nicely baked into the private _yaml api.

After more disappointing performance, I discovered it makes a big difference how you construct your SafeLoader:

$ time python3 -c 'from ruamel.yaml import YAML; yaml = YAML(typ="safe"); f = open("/path/to/manifest.yaml"); d = yaml.load(f);'

real    0m0.787s
user    0m0.740s
sys     0m0.040s

$ time python3 -c 'from ruamel import yaml; f = open("/path/to/manifest.yaml"); d = yaml.load(f, yaml.loader.SafeLoader);'

real    0m12.256s
user    0m12.180s
sys     0m0.060s

The docs showing the 'YAML' usage also say this:

This is the new (0.15+) interface for ruamel.yaml, it is still in the process of being fleshed out. Please pin your dependency to ruamel.yaml<0.15 for production software.

http://yaml.readthedocs.io/en/latest/basicuse.html

We don't seem to require a particular version of ruamel just now: https://gitlab.com/BuildStream/buildstream/blob/master/setup.py#L169

I'm guessing that given this, we'd prefer not to use the 'YAML' interface until the advice changes.

If that's the case, would the Python library json module be acceptable? It performs very well:

$ time python3 -c 'import json; f = open("/path/to/manifest.json"); d = json.load(f);'

real    0m0.050s
user    0m0.030s
sys     0m0.010s

I can see that with everything else being yaml, you may not be keen on that inconsistency :)

I can see that with everything else being yaml, you may not be keen on that inconsistency :)

Yeah really not; I suppose however, so long as we pin ruamel to a given specific version, it might be acceptable to pin it to a specific version that is >= 0.15 for the sake of using that API in places where it counts - this gives us a future path forward without breaking artifact format.

I think that for the huge lists you are talking about 0m0.787s vs 0m0.050s is a negligible difference.

Also, I considered just a flat file with a list of paths, but that would be quite undesirable if and when we want to augment that data with attribute information (permissions, xattrs, etc).

You might want to compare the differences in load times:

Not good, it's a list that is not very forward compat:

files:
- filename1
- filename2

Better forward compatibility:

files:
- filename: filename1
- filename: filename2

Unfortunately with a dict per file, the costs seem to be significant again.

It's slower than utils.list_relative_paths():

$ time python3 -c 'from ruamel.yaml import YAML; yaml = YAML(typ="safe"); f = open("/path/to/manifest.yaml"); d = yaml.load(f);'
real    0m3.053s
user    0m2.940s
sys     0m0.090s

$ python3 -c 'import timeit; path = "/path/to/files"; print(timeit.timeit("list(buildstream.utils.list_relative_paths(" + repr(path) + "))", number=1, setup="import buildstream"))'
2.478195433039218

$ time find /path/to/files > /dev/null
real    0m0.253s
user    0m0.000s
sys     0m0.230s

I had a quick go at adapting list_relative_paths() to use Python 3.5's os.scandir(), it took roughly 0.5 seconds to run instead. We're supporting minimum Python 3.4 though, so not much help.

Ooh, just avoiding calls to os.path.relpath() makes it much faster. I'll send an MR for that.

I've made an attempt at persuading you over to JSON for artifact metadata on the mailing list :) https://mail.gnome.org/archives/buildstream-list/2017-October/msg00029.html

Following on from your message on the mailing list, does this mean would you accept a patch which read and wrote manifest.yaml using Python's json module? The idea being that we'd just be using it as an optimised yaml parser behind the scenes.

Ok so - while it does mean we can live with a less desirable metadata until such a time we can change to nicely formatted yaml, do we have to jump on that opportunity right away ?

Is this immensely pressing right now ?

Also what about the denormalization approach of using a flat lists also mentioned in that thread, is this an option before we resort to this ?

Is this immensely pressing right now ?

Nah, definitely more interested in the overall benchmarking.

I thought we could cross this one off easily now with satisfaction after the last message, maybe I just read it the way I wanted to :)

Will leave it for now and circle back when it seems the most important thing. Thanks!

mentioned in issue #174 (closed)

It would be interesting to know if the cas based cache and the virtual directory work done to support remote execution affect this at all, and if so if it's in a positive or negative way.

changed milestone to %BuildStream_v1.4

This is quite old, we now have bst artifact list-contents which achieves the purposes without requiring manifests, let's close this.

closed

Store artifact manifests in metadata

Child items ...

Activity