Is there a way to allow easy meta data harvesting and data publication
Considering the idea to implement an OAI-Connector for CaosDB (https://www.openarchives.org e.g. PANGAEA oder https://marine-data.org) triggered the following train of thought:
One difference between CaosDB and Data Repositories is the following: In Data Repositories, you upload some complete set. The extend of the set may vary but the repo requires essential meta data to be present. This means that all those complete sets can be harvested. In CaosDB the same data may be present but it is less clear where the "edge" of the complete set is. A single file recorded in an experiment will not have the required meta data. Where the meta data is, depends on the specific use case. In order to allow meta data harvesting I guess it is necessary to group several Records. Also in other discussions we touched this: Preparation for publication in data repositories needs it. One understanding of FAIRDOs might require it. May be it is a good idea to define it on the level the data model: Some substructure is publishable once all required information exists. This structure could be referenced as FAIRDO, the metadata could be harvested and it can be published and distributed.
One possibility would be to define a sparse structure, something like subset of the data model that defines the required data Elements. Multiple such sparse structures could exist to define what is necessary to publish a simulation, an experiment result etc.
Unfortunately, I do not think that it would be a good idea to use "Obligatory Properties" for this. Since this would require a researcher to add all information right away and might increase the barrier to use the system.
"Recommended Properties" could be used. This would mean that the "edge" of the complete set is defined with this importance automatically. This would add another meaning to the Property Importances:
- Obligatory: Record cannot be created without this Property
- Recommended: Record cannot be published without this Property
- Suggested: Everything else
In order to get the complete data set of e.g. a simulation, you would start at the simulation Record and traverse the graph using obligatory or recommended Properties.
Advantage: No separate definition necessary. Disadvantage: Another aspect (publication) needs to be considered when designing the data model. Changing the data model might violate publication requirements.