Quantify dataset design proposal
This issue is intended to address a specific task of #187 (closed): converge on the design of the "new" quantify dataset.
Similar notebook to the one I shared in #187 (closed) now with the design from below applied, thought I had no time to add many more examples.
Reminder
As before, in the quantify dataset the xarray
variables named as x{i}
are set as xarray
coordinates (they appear under the "Coordinates" dropdown in the dataset representation of Jupyter Lab). BUT there are no associated xarray
dimensions (they appear under the "Dimensions" dropdown in the dataset representation of Jupyter Lab) with the same name. In other words, the xarray
coordinates x0
, x1
, etc. are not used to index the y0
, y1
, etc. xarray
variables.
To achieve that (which only makes sense for measurements that where done on a grid), we used quantify_core.data.handling.to_gridded_dataset(dataset, dimension="dim_0", coordinates=["x0", "x1", etc.])
. After this conversion you will notice that x0
, x1
, etc. are promoted to coordinates indexing a dimension of the same name and they show up in bold font.
Quantify dataset proposal
Besides the above below follows a more details description of the proposed definition of the quantify dataset.
xarray
dimensions
-
[Required]
replication
-
Intuition for this
xarray
dimension: the dummy way of storing would be to havedataset_reptition_0.hdf5
,dataset_reptition_1.hdf5
, etc. where each dataset was obtained from repeating exactly the same experiment. -
Naming rationale:
- Replaces
repetition
, with a more precise name which is indicative of its intended usage. [From dictionary: "replication = the repetition of a scientific experiment or trial to obtain a consistent result".] - Does not clash with
schedule.repetitions
nor any kind of hardware repetitions.
- Replaces
-
Default behavior of plotting tools will be to average the dataset along this dimension.
-
[Optional]
replication
can be indexed by an optionalxarray
coordinate variable.- The variable must be named
replication
.
- The variable must be named
-
-
[Required] no other outer
xarray
dimensions allowed.-
replication
will be the outermostxarray
dimension allowed in the dataset! - Rationale:
- The plotting and analysis toolboxes need to reply on some assumptions about the dataset. Arguably, the "top level" of the dataset is critical for this, despite the
xarray
's flexibility of selecting data.
- The plotting and analysis toolboxes need to reply on some assumptions about the dataset. Arguably, the "top level" of the dataset is critical for this, despite the
-
-
[Required]
acq_set_0
-
Note: intended to rename
dim_0
. -
Naming rationale:
- Have a more meaningful name for this
xarray
dimension in the context of measurements. - Getting ready for non-standard experiment flows (see next point).
- "acquisition set" gives some intuition that things are acquired together/parallel.
- Have a more meaningful name for this
-
-
[Optional]
acq_set_{i}
, wherei
> 0 is an integer.-
Reserves the possibility to store data for experiments that we have not yet encountered ourselves. I a gut feeling that we need this, but might not have a good realist example, some help here is welcome.
- (Example ?) Imagine measuring some qubits until all of them are in a desired state, returning the data of these measurements and then proceeding to doing the "real" experiment you are interested in. I think having these extra independent
xarray
dimensions
- (Example ?) Imagine measuring some qubits until all of them are in a desired state, returning the data of these measurements and then proceeding to doing the "real" experiment you are interested in. I think having these extra independent
-
[Required] all
acq_set_{i}
dimensions (includingacq_set_0
) are mutually excluding. This means variables in the dataset cannot depend on more than one of these dimensions.-
Bad variable:
y0(replication, acq_set_0, acq_set_1)
, this should never happen in the dataset. -
Good variable:
y0(replication, acq_set_0)
ory1(replication, acq_set_1)
.
-
Bad variable:
-
-
[Optional, Advanced] other (nested)
xarray
dimensions under eachacq_set_{i}
-
Intuition: intended primarily for time series, aka "time trace".
-
Other, potentially arbitrarily nested,
xarray
dimensions under eachacq_set_{i}
is allowed. What this means is that each entry in a, e.g.,y3
xarray
variable can be a 1D, or nD array where each "D" has a correspondingxarray
dimension. -
Such
xarray
dimensions can be named arbitrarily. -
Each of such
xarray
dimension can be indexed by anxarray
coordinate variable. E.g. for a time trace we would have in the dataset:assert "time" in dataset.coords
assert "time" in dataset.dims
-
assert len(dataset.time) == len(dataset.y3.isel(replication=0, acq_set_0=0))
wherey3
is a measured variable storing traces.
-
Note: (Just an reiteration of what a
int
/float
/complex
numpyndarray
is) when nesting data like this, it is required to have "hyper-cubic"-shaped data, meaning that e.g.dataset.y3.isel(replication=0, acq_set_0=0) == [[2], [ 5, 6]]
is not possible, butdataset.y3.isel(replication=0, acq_set_0=0) == [[2, 3], [5, 6]]
is.
-
xarray
coordinates variables
Only the following xarray
coordinates are allowed in the dataset:
-
[Optional] variable that are "physical coordinates", usually equivalent to settables, usually a parameter that an experimentalist "sweeps" in order to observe the effect on some other physical property of a system.
- Should be referred to as "physical coordinates" or "experiment coordinates". These are the coordinates that index the "experiment variables". This indexing can be made explicit in
xarray
with thequantify_core.data.handling.to_gridded_dataset()
(when this makes sense). - Naming:
x{i}
wherei
=> 0 is an integer. - Note on [Optional]: It should not be required to have at least one
x0
in order to perform the measurement. For some experiments it might not be suitable to think of something being swept. - [Required] each
x{i}
must be "lie" along one (and only one)acq_set_{i}
xarray
dimension.
- Should be referred to as "physical coordinates" or "experiment coordinates". These are the coordinates that index the "experiment variables". This indexing can be made explicit in
-
[Optional] coordinates used to index dimensions.
- Naming:
replication
, oracq_set_{i}
or arbitrary but with the same name as one of the nested dimensions (mentioned above). - [Required] must "lie" along that (and only that) corresponding
xarray
dimension.
- Naming:
xarray
data variables
These xarray
data variables are "experiment variables" usually indexed by the "experiment coordinates" mentioned above. Each entry in one of these "experiment variables" is a data-point in the broad sense, i.e. it can be int
/float
/complex
OR a nested numpy.ndarray
(of one of these dtypes
).
All the xarray
data variable in the dataset (that are not xarray
coordinates) comply with:
- Naming:
-
y{i}
wherei
=> 0 is an integer; OR -
y{i}_<arbitrary>
wherei
=> 0 is an integer such that it matches a an existingy{i}
in the same dataset.- This is intended to denote a meaningful connection between
y{i}
andy{i}_<arbitrary>
. - E.g., The digitized time traces stored in
y0_trace(replication, acq_set_0, time)
and the demodulated valuesy0(replication, acq_set_0)
represent the same measurement in a different way.
- This is intended to denote a meaningful connection between
- Rationale: facilitates inspecting and processing the dataset in an intuitive way.
-
- [Required] "lie" along at least the
replication
andacq_set_{i}
dimensions.-
Goog example:
y0(replication, acq_set_0)
-
Bad example:
y1(replication)
ory2(acq_set_0)
-
Goog example:
- [Optional] "lie" along additional nested
xarray
dimensions.
At least for now, in interest of time and scope reduction, there should not be any other xarray
variables in the dataset that do not comply with these requirements.
Calibration points: a caveat to all of the above
I believe that the most elegant way of dealing with calibration points is as follows:
- An
xarray
dimension namedacq_set_{i}_calib
. -
xarray
data variables named asy{j}_calib
which must "lie" along theacq_set_{i}_calib
, i.e.y{j}_calib(replication, acq_set_{i}_calib, ...)
. Note that we would havey{j}(replication, acq_set_{i}, ...)
. - NB
y{i}_<arbitrary>_calib
is also valid.
Some pros:
- minimize manual indexing and clumsy data processing,
- intuitive dataset,
- the naming convention would allow the plotting and analysis utilities to recognize the presence of such information from the variable name.
Some cons:
- When processing data some code duplication might arise or at least a
for
loop will be required if it is desired to apply the same data-processing on both the data and the calibration points. I would say a reasonable price to pay.
As the title suggests this is a bit of an exception to all that have been preached so far but i trust you are able to understand what kind of changes are required in the overall specification of the quantify dataset to account for the "special" case of "calibration variables". PS lets call them that: "calibration" variables.
Dataset attributes
-
[Required]
"quantify_dataset_version"
:"v1"
- Just in case we break it a few times, at least the code will be able to do sanity checks.
-
... to be written ...
Variables attributes
- ... to be written ...
Final note
Dataset and variable attributes will be similar to current ones. I had no time to reiterate those. The rest is more important at the moment.
@AdriaanRol here we go, we can now nerd and nitpick about this stuff.