Curation service
Attention: this is an uncomplete proposal draft!
Description
- Parsing: output from multifetch is pushed to mongodb
RAW
collection (without performing templating). Records that failed parsing have to fixed manually by curator then pushed again. - Reduce curation: mapping extra attributes and coarse curation
- Curator selects a set of records to curate (typically
wip version
,all
,attrX==Y
,invalid
). - Selected records appear as a record list with colors green (attribute valid) or red attribute invalid.
- Curator selects an attribute to curate and gets the list of all values taken by this attribute in the selection.
- Curator can then select one value and replace it with an existing value or a manual one. It is replaced in all the records having this value.
- Fine curation: same as Coarse curation except that:
- Extra attributes are not available anymore
- Can modify older records
- Each time a curation is done, the involved records are push to database even if the validation fails.
Remarks:
- If two curators curate the same set of records at the same time, then the one pushing the latest will override the changes made by the one pushing the first. This is acceptable at first.
- The system is meant to be stateless: doesn't need to maintain states (
new
,invalid
,curated
,committed
, ...). ISSUE: How to deal with the fact that some records may be in the raw state (because expression failed), other in the expressed 1 state (because failed templating), others in template state (because expression 2 failed), ... The only condition for a new release to be performed is to contain only valid records. - Validation = expression + templating + JSON validation
- Values detected as templates should have a special color and upon hovering over should show the result of the template resolution.
- Fields that do not match the validation should be red and hovering over should show the error message. Template fields should discard any validation error.
- Evidences are separated from values, but otherwise we still use the tsv format (for now).
Should we work with un-templated data? In this case we should work with a wip
database because the real db is supposed to contain only templated & unexpressed & valid records. So if we work with un-templated and potentially failed data, they should maybe be stored in wip
. In this case, how to update existing records? Copy them to wip
? (It would make sense since we do not want to update released records!!).
Edited by mma227