New trajectory file format (HDF5/H5MD?)
Summary
There is a need/request for a more modern trajectory file format in GROMACS. This will be developed as part of the MDDB project.
The XTC format in GROMACS has been around for 30 years (approximately). It is the de facto standard for lossy compression of coordinate trajectories.
TNG was developed, slightly more than ten years ago, to provide a replacement. It also has efficient lossy compression, including temporal (multi frame) compression. It is an extensible format, able to store the molecular topology (atom information and connectivity etc.), which means that it is self-contained. However, the format did not catch on, i/o plugins were not included in VMD distributions etc., and it was unnecessarily much work supporting and further developing a dedicated library and API.
The plan is to switch to a new trajectory file format, but to use existing standards as far as possible. The file format should be modern and self documented. There must be compression (lossy and lossless) that is as efficient as XTC and TNG.
I/O and Feature Requirements
- Writing single frames should not be significantly slower than writing XTC/TNG. Simulation I/O should be less than 1% for most simulations.
- Support random access to frames without reading all intermediate data.
- Support exact restarts after checkpointing.
- Store a simple molecular system description, enough to open in visualization programs and/or run analyses.
- It should be extensible. It should be possible to store, e.g., energies and provenance records.
Compression Requirements
- Must match the XTC compression level.
- Compression levels must be tunable for accuracy.
- Can guarantee accuracy, specified as absolute precision, with relative precision as a possible alternative if chosen by the user.
- Can store lossily compressed data along with uncompressed (or losslessly compressed) data in the same file. E.g., store lossy coordinates and lossless velocities and forces in the same file. Preferably both lossless and lossy coordinates in parallel.
- Multiframe compression. This does not improve compression much if the frames are sparsely stored, but in cases where data is written frequently it can have a significant impact, especially for parts of the system with slow movements.
Library and Code Requirements
- A library that can be called from C++, C, Fortran and Python with well-documented APIs.
- Fully portable to all available operating systems.
- Freely available and reusable under BSD (or similar) license.
- In order to improve compatibility across codes and reduce maintenance, rely on existing standards, where possible.
- If an external library is very large, investigate if a stripped-down version can be included in GROMACS.
Status after investigation
- HDF5 is a widely used file format.
- H5MD (https://www.nongnu.org/h5md/ and https://www.sciencedirect.com/science/article/pii/S0010465514000447) is a specification for storing molecular data.
- MDAnalysis can handle H5MD i/o: https://docs.mdanalysis.org/2.0.0/documentation_pages/coordinates/H5MD.html.
- There is a VMD plugin for H5MD files: https://github.com/h5md/VMD-h5mdplugin, but it might need some improvements.
- Supports compression filter plugins
- We have started a collaboration with the SZ3 compression (https://github.com/szcompressor/SZ3) developers in the FZ compression framework (https://szcompressor.org/next.szcompressor.github.io/).
- They are helping us improving their SZ3/MDZ compression filter plugin to work better with biophysical (water rich) data.
- We have started a collaboration with the SZ3 compression (https://github.com/szcompressor/SZ3) developers in the FZ compression framework (https://szcompressor.org/next.szcompressor.github.io/).
- H5MD (https://www.nongnu.org/h5md/ and https://www.sciencedirect.com/science/article/pii/S0010465514000447) is a specification for storing molecular data.
There is a branch available with prototype code for H5MD i/o in GROMACS (ml_h5md-prototype). Warning, this is WIP. This will be split into smaller MRs implementing isolated features.
This is an umbrella issue covering the development of H5MD i/o.
It may still turn out that HDF5/H5MD does not fulfill our requirements and needs and that we will have to find another alternative. But so far, we are working with the aim of using H5MD as a new trajectory format.
Feature checklist in prototype branch
-
Handle HDF5 library in CMake. -
Write and read box, coordinates, velocities and forces to "/particles/<group>/box", "/particles/<group>/position", "/particles/<group>/velocity" and "/particles/<group>/force" groups. -
Lossless compression -
Lossy compression of coordinates (SZ3 bio compression (WIP) and SZ3 XTC implementation) -
Lossy compression of velocities and forces -
Write separate parts of the system as separate particles groups to enable different compression settings for different parts of the system.
-
-
Write masses and charges to "/particles/<group>/mass" and "/particles/<group>/charge" groups. -
Write connectivity records to "/connectivity" -
Write energy data to "/observables" -
Write a custom GROMACS system topology to "/parameters" -
Write a VMD compatible topology to "/parameters/vmd_structure" (currently not writing full set of records) -
Continue from checkpoint, appending to previous file -
Make backups of existing files if creating a new file with the same name -
Write provenance records
Feature checklist in main
-
Handle HDF5 library in CMake. ( !4372 (merged)) -
Write and read box, coordinates, velocities and forces to "/particles/<group>/box", "/particles/<group>/position", "/particles/<group>/velocity" and "/particles/<group>/force" groups. -
Lossless compression -
Lossy compression of coordinates (SZ3 bio compression (WIP) and SZ3 XTC implementation) -
Lossy compression of velocities and forces -
Write separate parts of the system as separate particles groups to enable different compression settings for different parts of the system.
-
-
Write masses and charges to "/particles/<group>/mass" and "/particles/<group>/charge" groups. -
Write connectivity records to "/connectivity" -
Write energy data to "/observables" -
Write a custom GROMACS system topology to "/parameters" -
Write a VMD compatible topology to "/parameters/vmd_structure" -
Write provenance records
Important things to consider/keep in mind
-
Should there be two separate classes, one for reading and one for writing? See !4372 (comment 1972945129).
HDF5 version compatibility
We are not longer using target_compile_options(fileio INTERFACE "-DH5_USE_110_API")
to enforce using v1.10 compatible macros in the API. To handle API version incompatibilities specific functions need to be managed in the source code. In the prototype version of the code, there is currently one function that needs special version treatment:
# if H5_VERS_MINOR < 12
H5Oget_info_by_name(locationId, name, &infoBuffer, H5P_DEFAULT);
# else
H5Oget_info_by_name(locationId, name, &infoBuffer, H5O_INFO_BASIC, H5P_DEFAULT);
# endif
See https://docs.hdfgroup.org/hdf5/v1_14/group___h5_o.html#ga96ce408ffda805210844246904da2842, https://docs.hdfgroup.org/hdf5/v1_14/group___h5_o.html#ga0090da86c086c1c63a5acfaed39a035e and https://docs.hdfgroup.org/hdf5/v1_14/group___h5_o.html#gabb69c962999e027cef0079bbb1282199 for more information.
The prototype version of the HM5D implementation in GROMACS is tested using HDF5 versions 1.10 and 1.14.