Extract data + metadata for the entire PDB
We need a pipeline that would process the PDB to extract useful data + metadata:
- IntRAchain and intERchain interactions
- Interactions with small molecules
We should probably use mmCIF files for this (see #13 #14).
Notes
-
It takes a long time to process a non-trivial number of PDB files. We could convert all PDBs into a faster binary format such as HDF5, but this would make it difficult for us to distribute our code to others...
-
The pipeline should be reasonably easy to run for new PDBs. This way we could always fetch a PDB from the RCSB website if it is not available locally.