When saving datasets, make sure digest recorded really matches the dataset

So currently we just save digest if it is in metadata. This is not correct. We should always save (correct) digest, no matter if it is in metadata. I suggest that Dataset.save interface is extended with compute_digest: ComputeDigest = ComputeDigest.ALWAYS keyword argument. This is similar to Dataset.load but only that the default value is different. Meaning of values would be:

  • NEVER - as it is now, store digest only if it is available in metadata
  • ONLY_IF_MISSING - if there is no digest in metadata, compute it, otherwise store it as-is
  • ALWAYS - compute fresh digest always and store that, it does not matter if it is in metadata or not

So digest in metadata becomes obsolete very quickly because after loading Dataset object can modify stuff quickly. Digest is not updated but stays there to be able to reference the original source of data. But when saving we have to update based on latest data and metadata in Dataset object.

Not to integrate it anywhere, but you can also use update-digest script to validate during development of this that things are correctly saved. So script should not update the digest. What I did was loaded and saved all d3m datasets using CLI interface and then ran update-digest.py script and currently all digests get updated. That is not OK. But you can test that this is not happening after this is fixed.