Update logic around problem, pipeline, and dataset IDs (!75) · Merge requests · datadrivendiscovery / d3m

Fixes #190 (closed) and #154 (closed).

NOTE: This description is obsolete.

Datasets can be uniquely represented by their ID and digests. This MR adds some more logic on how and when to generate digests during dataset loading. This can take some time (even in range of 10 seconds on large datasets) so by default now it computes digests only if they are missing. I made also https://gitlab.datadrivendiscovery.org/d3m/datasets/merge_requests/14 to add to datasets their digests so that for seed datasets loading should be quick.
- Ideally, we should always compute the digest to make sure we record correctly pipeline runs for correct version of a dataset, but this means every creation of a Dataset object could take time.
- We could introduce per-process cache of those digests. In a way that it computes digest always for the first time a particular dataset is loaded and compares it with value in metadata, but later on inside same process it reuses computed digest. Ideally, datasets will not be changing during one process running.
Added to runtime CLI arguments which allow you to control when dataset digests are computed.
Problem description and pipeline descriptions have now deterministically generated ID based on their content.
Always when pipeline is stored it uses correct ID, but when it is loaded it will just warn if ID in the document does not match computed one.
Checking pipeline for validity will throw an exception if ID does not match.
I had to implement a "canonical pipeline description" over which we compute ID, because computing ID over pipeline document blindly does not work in cases when pipeline document contains nested sub-pipeline documents (instead of just referencing them). I think this is OK because in the metalearning database we should not be storing nested pipeline documents anyway and if somebody will want, a blind ID computation/check will reject it. Nested pipeline documents are a special case anyway.
When computing primitive digest, primitive's ID is included in the hash so that digest is not the same for all primitives from the same package (see #154 (closed) for more information).

Edited Jan 17, 2019 by Mitar

Update logic around problem, pipeline, and dataset IDs

NOTE: This description is obsolete.

Merge request reports