Quantify dataset design proposal

This issue is intended to address a specific task of #187 (closed): converge on the design of the "new" quantify dataset.

Similar notebook to the one I shared in #187 (closed) now with the design from below applied, thought I had no time to add many more examples.

quantify-core_233.ipynb

Reminder

As before, in the quantify dataset the xarray variables named as x{i} are set as xarray coordinates (they appear under the "Coordinates" dropdown in the dataset representation of Jupyter Lab). BUT there are no associated xarray dimensions (they appear under the "Dimensions" dropdown in the dataset representation of Jupyter Lab) with the same name. In other words, the xarray coordinates x0, x1, etc. are not used to index the y0, y1, etc. xarray variables.

To achieve that (which only makes sense for measurements that where done on a grid), we used quantify_core.data.handling.to_gridded_dataset(dataset, dimension="dim_0", coordinates=["x0", "x1", etc.]). After this conversion you will notice that x0, x1, etc. are promoted to coordinates indexing a dimension of the same name and they show up in bold font.

Quantify dataset proposal

Besides the above below follows a more details description of the proposed definition of the quantify dataset.

`xarray` dimensions

[Required] replication
- Intuition for this xarray dimension: the dummy way of storing would be to have dataset_reptition_0.hdf5, dataset_reptition_1.hdf5, etc. where each dataset was obtained from repeating exactly the same experiment.
- Naming rationale:
  - Replaces repetition, with a more precise name which is indicative of its intended usage. [From dictionary: "replication = the repetition of a scientific experiment or trial to obtain a consistent result".]
  - Does not clash with schedule.repetitions nor any kind of hardware repetitions.
- Default behavior of plotting tools will be to average the dataset along this dimension.
- [Optional] replication can be indexed by an optional xarray coordinate variable.
  - The variable must be named replication.
[Required] no other outer xarray dimensions allowed.
- replication will be the outermost xarray dimension allowed in the dataset!
- Rationale:
  - The plotting and analysis toolboxes need to reply on some assumptions about the dataset. Arguably, the "top level" of the dataset is critical for this, despite the xarray's flexibility of selecting data.
[Required] acq_set_0
- Note: intended to rename dim_0.
- Naming rationale:
  - Have a more meaningful name for this xarray dimension in the context of measurements.
  - Getting ready for non-standard experiment flows (see next point).
  - "acquisition set" gives some intuition that things are acquired together/parallel.
[Optional] acq_set_{i}, where i > 0 is an integer.
- Reserves the possibility to store data for experiments that we have not yet encountered ourselves. I a gut feeling that we need this, but might not have a good realist example, some help here is welcome.
  - (Example ?) Imagine measuring some qubits until all of them are in a desired state, returning the data of these measurements and then proceeding to doing the "real" experiment you are interested in. I think having these extra independent xarray dimensions
- [Required] all acq_set_{i} dimensions (including acq_set_0) are mutually excluding. This means variables in the dataset cannot depend on more than one of these dimensions.
  - Bad variable: y0(replication, acq_set_0, acq_set_1), this should never happen in the dataset.
  - Good variable: y0(replication, acq_set_0) or y1(replication, acq_set_1).
[Optional, Advanced] other (nested) xarray dimensions under each acq_set_{i}
- Intuition: intended primarily for time series, aka "time trace".
- Other, potentially arbitrarily nested, xarray dimensions under each acq_set_{i} is allowed. What this means is that each entry in a, e.g., y3 xarray variable can be a 1D, or nD array where each "D" has a corresponding xarray dimension.
- Such xarray dimensions can be named arbitrarily.
- Each of such xarray dimension can be indexed by an xarray coordinate variable. E.g. for a time trace we would have in the dataset:
  - assert "time" in dataset.coords
  - assert "time" in dataset.dims
  - assert len(dataset.time) == len(dataset.y3.isel(replication=0, acq_set_0=0)) where y3 is a measured variable storing traces.
- Note: (Just an reiteration of what a int/float/complex numpy ndarray is) when nesting data like this, it is required to have "hyper-cubic"-shaped data, meaning that e.g. dataset.y3.isel(replication=0, acq_set_0=0) == [[2], [ 5, 6]] is not possible, but dataset.y3.isel(replication=0, acq_set_0=0) == [[2, 3], [5, 6]] is.

`xarray` coordinates variables

Only the following xarray coordinates are allowed in the dataset:

[Optional] variable that are "physical coordinates", usually equivalent to settables, usually a parameter that an experimentalist "sweeps" in order to observe the effect on some other physical property of a system.
- Should be referred to as "physical coordinates" or "experiment coordinates". These are the coordinates that index the "experiment variables". This indexing can be made explicit in xarray with the quantify_core.data.handling.to_gridded_dataset() (when this makes sense).
- Naming: x{i} where i => 0 is an integer.
- Note on [Optional]: It should not be required to have at least one x0 in order to perform the measurement. For some experiments it might not be suitable to think of something being swept.
- [Required] each x{i} must be "lie" along one (and only one) acq_set_{i} xarray dimension.
[Optional] coordinates used to index dimensions.
- Naming: replication, or acq_set_{i} or arbitrary but with the same name as one of the nested dimensions (mentioned above).
- [Required] must "lie" along that (and only that) corresponding xarray dimension.

`xarray` data variables

These xarray data variables are "experiment variables" usually indexed by the "experiment coordinates" mentioned above. Each entry in one of these "experiment variables" is a data-point in the broad sense, i.e. it can be int/float/complex OR a nested numpy.ndarray (of one of these dtypes).

All the xarray data variable in the dataset (that are not xarray coordinates) comply with:

Naming:
- y{i} where i => 0 is an integer; OR
- y{i}_<arbitrary> where i => 0 is an integer such that it matches a an existing y{i} in the same dataset.
  - This is intended to denote a meaningful connection between y{i} and y{i}_<arbitrary>.
  - E.g., The digitized time traces stored in y0_trace(replication, acq_set_0, time) and the demodulated values y0(replication, acq_set_0) represent the same measurement in a different way.
- Rationale: facilitates inspecting and processing the dataset in an intuitive way.
[Required] "lie" along at least the replication and acq_set_{i} dimensions.
- Goog example: y0(replication, acq_set_0)
- Bad example: y1(replication) or y2(acq_set_0)
[Optional] "lie" along additional nested xarray dimensions.

At least for now, in interest of time and scope reduction, there should not be any other xarray variables in the dataset that do not comply with these requirements.

Calibration points: a caveat to all of the above

I believe that the most elegant way of dealing with calibration points is as follows:

An xarray dimension named acq_set_{i}_calib.
xarray data variables named as y{j}_calib which must "lie" along the acq_set_{i}_calib, i.e. y{j}_calib(replication, acq_set_{i}_calib, ...). Note that we would have y{j}(replication, acq_set_{i}, ...).
NB y{i}_<arbitrary>_calib is also valid.

Some pros:

minimize manual indexing and clumsy data processing,
intuitive dataset,
the naming convention would allow the plotting and analysis utilities to recognize the presence of such information from the variable name.

Some cons:

When processing data some code duplication might arise or at least a for loop will be required if it is desired to apply the same data-processing on both the data and the calibration points. I would say a reasonable price to pay.

As the title suggests this is a bit of an exception to all that have been preached so far but i trust you are able to understand what kind of changes are required in the overall specification of the quantify dataset to account for the "special" case of "calibration variables". PS lets call them that: "calibration" variables.

Dataset attributes

[Required] "quantify_dataset_version" : "v1"
- Just in case we break it a few times, at least the code will be able to do sanity checks.
... to be written ...

Variables attributes

... to be written ...

Final note

Dataset and variable attributes will be similar to current ones. I had no time to reiterate those. The rest is more important at the moment.

@AdriaanRol here we go, we can now nerd and nitpick about this stuff.

Edited Aug 02, 2021 by Victor Negîrneac