Validator for Quantify dataset v2

Description

We (mostly me so far) have chosen in !243 (closed) to use an [xarray accessor] (http://xarray.pydata.org/en/latest/internals/extending-xarray.html). This poses some challenges on how validation is implemented.

I will leave here the history of discussion between me and @luis.miguens from #254 about the validator in order to keep things more manageable.

@caenrigen:

Note about implementation, when the to_gridded_dataset function is used the dataset can be fully gridded or in a mixed format of gridded and non-gridded coordinates, this makes it difficult to define two separate classes to identify a "raw" Quantify dataset and a "processed" Quantify dataset. Therefore my idea is to the have only one functional QuantifyDataset and a class RawQuantifyDataset that is simply a validator in the same way the Settable and Gettable is, i.e.:

For validation purposes we would add a is_raw_quantify_dataset property that would check that all the main/secondary variables are not being indexed by any main/secondary coordinates explicitly (i think this is what needs to be checked for, according to the spec of !224 (merged), in any case something along these lines). The to_gridded_dataset sets explicit indexes and unstacks the data.

@luis.miguens:

Hi @caenrigen,

What do you mean by "one functional QuantifyDataset and a class RawQuantifyDataset that is simply a validator in the same way the Settable and Gettable"?

@caenrigen:

The Settable and Gettable classes cannot be instantiated, they just check if the passed in object is valid and return back the same object, e.g. my_settable_qcodes_param == Settable(my_settable_qcodes_param).

For the Quantify dataset we are going to have a slight issue that the (xarray) object might get modified a bit (e.g. by the to_gridded_dataset function) and that modified object is not going to be strictly in the same "format" as the raw datasets that are loaded from disk. This is a slight problem when we will want to specify that the input to some function must be that "raw" Quantify dataset (that was not modified at all).

Does this make sense?

It is not very easy to explain, this is in any case a suggestion for now, input from other people is welcome.

Not sure if you already read the docs of !224 (merged), that might help understanding the potential problem.

@luis.miguens:

I have a remark regarding the proposed implementation extending xarray.Dataset. According to the documentation of xarray, this way of extending is prone to problems.

The developers of xarray recommend extending their datasets using the decorators: register_dataset_accessor() and register_dataarray_accessor().

More information is available on: http://xarray.pydata.org/en/stable/internals/extending-xarray.html

@luis.miguens:

I looked into libraries for data validation, and I found:

xsimlab: This is part of a package that uses xarray for simulations; as a part of the package, we can find the validation functionality. Maybe overkill for our case.

https://xarray-simlab.readthedocs.io/en/latest/_api_generated/xsimlab.Model.validate.html

Pandera: Originally for data validation of pandas, now it is generic and supports validation of other types of data. The problem is that we need to write an engine to process xarray (although I understood that it is a work in progress, so maybe if we wait, we can use it. Very promising option.

https://github.com/pandera-dev/pandera

https://github.com/pandera-dev/pandera/issues/369

https://github.com/openclimatefix/nowcasting_dataset/issues/211

Pydantic: A very generic data validation library. People have been using it for validation of xarrays (extending Dataset rather than using decorators).

https://pydantic-docs.helpmanual.io/

https://github.com/openclimatefix/nowcasting_dataset/blob/jack/pydantic/notebooks/pydantic_xarray.ipynb

The most elegant solution for me is implementing the Engine for xarray in Pandera (or wait until someone else does). Second option is implementing the validation with Pydantic.

If someone sees the use of xsimlab besides data validation, maybe we can use it, but for exclusively data validation is overkill, in my opinion.

Does anybody have any other suggestions?

@caenrigen:

Hi Luis, great findings!

xsimlab seems like an awesome project, I wished I used it during my thesis simulations hehehe. @AdriaanRol please check it out might very relevant for OQS and it integrates nicely with xarray.

On the validators side, indeed it seems like there is not that much that we can make use of;

However their docs directly point to the attrs package that might be useful but I am not sure it can be applied easily to the validation of the xr Dataset/DataArrays and its attributes, since it seems to require the setup of the validators in the code of the classes and methods.

Sounds interesting and nice that they are making it more general

There seems to be some recent interest in supporting xarray, namely n-dimensional arrays validation ( here, here and here).

Pandera also lists a bunch of other validation tools: https://github.com/pandera-dev/pandera#alternative-data-validation-libraries that we could look into.

Pydantic might be still interesting.

extending Dataset rather than using decorators

Can we have the Pydantic in our custom xarray accessor class (instead of subclassing xarray objects)?

From your research and skimming the links, xsimlab does not seem to be interesting and overkill.

On the Pandera vs Pydantic side, could elaborate a bit more on the reasons for your preference?

Here are my ideas for the requirements of a validation package/solution for the dataset:

Well maintained package(s).

"Support" for xarray accessor since that is the recommended way to extend xarray. I would rather stick to the recommended way.

Validate attributes of the xr.Dataset and xr.DataArray. E.g. type, literal values, etc.

Validate the underlying numpy objects. E.g., dtype, shape/dimension, values, etc..

Custom (complex) validation. E.g. The QDatasetIntraRelationship of !224 (merged) requires validation that the DataArrays of the relationship actually exist in the dataset (typos in the names can happen easily).

Allow for two types of validation: imperative (e.g. the type of the attrs must always be correct) and "on demand" (some check should not be enforced, except when needed, e.g., some analyses will require a "raw" Quantify dataset as input, therefore, such a validation should be accessible to be executed "on demand").

I think using more than one "solution"/package might be ok. Namely the validation of the dataset attributes might be easier/enough with existing tools (!224 (merged) implements the attributes as dictionaries obtained from a dataclass and I expect there to already be many tools for validating these dictionaries against the corresponding dataclass).

Some nice to have:

Extendable by users, e.g., easily add "extra" validation in some specific analysis.

to make all the information actionable I would suggest if you could help with the following to start with?

Can we have the Pydantic in our custom xarray accessor class (instead of subclassing xarray objects)?

@luis.miguens:

Regarding my preference of pandera over pydantic, it is mainly in the simplicity of the schema for the validation. The problem is that pandera does not have the engine to use xarray (yet).

If we remove pandera for the equation, either because we don't want to wait or write the engine to connect to xarray, the only choice remains in Pydantic.

Pydantic is quite a stable package, but we must write more to have a validator working.

The documentation says that the way to use Pydantic via decorator is to register a validation function in a xarray namespace. Thus, we will need to call the validation function every time we assert that the dataset is valid.

I looked into the option of registering a hook in xarray. Still, it seems to be designed to add particular types to a dataset rather than generic hooks for add/remove operations. I was looking for a pre-insert hook to validate the data before adding it to the dataset (and rejecting the addition if the data does not comply with the schema). Unfortunately, I could not find a way to do it like that.

@caenrigen:

Hi Luis! Thank you very much for the research on this.

Pydantic is quite a stable package, but we must write more to have a validator working.

Since you already looked at both a bit, how do you think the amount of work required compares between: a) using pydantic vs b) writing the Pandera engine and using it?

Since for Pandera

it is mainly in the simplicity of the schema for the validation

I would say it can be worth longterm to invest in the engine for xarray.

Thus, we will need to call the validation function every time we assert that the dataset is valid.

About this, it is probably ok because in many cases we will not want to run the validation because people (or code) will work with "incomplete" or "in construction" datasets and having something that always tries to validate the dataset might get in the way and become rather clumsy.

I was looking for a pre-insert hook to validate the data before adding it to the dataset (and rejecting the addition if the data does not comply with the schema).

To make sure we do not miscommunicate, we will likely need validation for

dataset (e.g., how many DataArrays/coordinates/dimensions are in there)

specific DataArrays (e.g., shape, dtype)

Dataset.attrs and DataArray.attrs (e.g., present keys, type of the values, values of the types)

I am mentioning this because the last one (attrs) are a dictionary that can be modified in many ways and might be very difficult to actually have a mechanism to ensure that any change to that dictionary is validated. In short, do not worry much about "automatic validation on the fly" and keep in mind that we need a somewhat varied range of things to validate

Motivation

Follow up from #187 (closed), #254 and !224 (merged) (and !243 (closed)).

Several modules/functions will require a valid Quantify dataset as their interface, a validator is need to enforce this and catch problems asap.

You can also find us on Slack. For reference, the issues workflow is described in the contribution guidelines.