Configurable annotation
Description
Support import and display of configurable annotation
For display, see !522 (merged)
Import of annotation is now determined by the latest inserted row in the new annotationconfig
table. The migration script creates this table and populates with an import config that reflects the current annotation import.
The file annotation-config.yml
is used in the testdata, and can be used as an example of how an annotation config can be created.
Available generic annotation converters:
There are currently four available generic annotation converters.
- keyvalue
- json
- mapping
- meta
Of these, the json
-converter gives the most flexibility, and keyvalue the most transparent.
All the examples below generate the same output structure to the column annotations
in the annotation
table:
{
"PATH": {
"TO": {
"TARGET": {
"foo": 1,
"bar": 2
}
}
}
}
keyvalueconverter.py
)
keyvalue (Read key/value pairs from annotation.
Example:
Config:
- name: keyvalue
converter_config:
elements:
- source: FOO
target: PATH.TO.TARGET.foo
target_type: int
- source: BAR
target: PATH.TO.TARGET.bar
target_type: int
Annotation values: FOO=1;BAR=2
jsonconverter.py
)
json (Reads base16/32/64 encoded JSON data and parses it
Example config:
- name: json
converter_config:
elements:
- source: MYJSON
target: PATH.TO.TARGET
encoding: base16
Example annotation value: MYJSON=7B22666F6F223A20312C2022626172223A20327D
Note: base64.b16encode(json.dumps({"foo": 1, "bar": 2}).encode()).decode() == '7B22666F6F223A20312C2022626172223A20327D'
mappingconverter.py
)
mapping (Reads character (e.g. ,
) separated key/value-structures separated with e.g. :
.
Example config:
- name: mapping
converter_config:
elements:
- source: DABLA
target: PATH.TO.TARGET
item_separator: ',' # Default value
keyvalue_separator: ':' # Default value
value_target_type: int
Example annotation value: DABLA=foo:1,bar:2
metaconverter.py
)
meta (Use meta information (##INFO
header) to create JSON structures, where keys are fetched from the header, and values from the annotation. Requires the meta information line to match a given regex pattern for extracting keys.
Example config:
- name: meta
converter_config:
elements:
- source: DABLA
target: PATH.TO.TARGET
meta_pattern: (?i)[a-z_]+\|[a-z_\|]+ # Default: Used to fetch keys
element_separator: "|"
subelements:
- source: foo
target_type: int
- source: bar
target_type: int
Example header line (meta information): ##INFO=<ID=DABLA,Number=.,Type=String,Description="Format: foo|bar">
Example annotation value: DABLA=1|2
Available specific annotation converters:
-
clinvarjson
-> Convert current clinvar data to form expected by database -
clinvarreferences
-> Read data from ClinVar json structure -
hgmdextrarefs
-> Read data fromHGMD__EXTRAREFS
-
vep
-> Read CSQ-field -
hgmd
-> Read HGMD specific fields
Ramblings
The generic annotation converters should handle most of the new annotation we could want.
However, adding new specific converters is also made a lot simpler (see for example https://gitlab.com/alleles/ella/-/blob/95026d6453a948e6fe104753bbf12a5cef217c20/src/vardb/deposit/annotationconverters/hgmd.py
).
There are three routes that are possible:
- Add a plugin system for adding converters "on the fly".
- Create more complex generic converters able to handle more complex annotation (albeit with more complex configuration)
- Keep it "as is", and just add specific converters for any new "weird" annotation data.
Related issues
Notes to review (code/docs/QA)
Only the data in the INFO-column is currently supported. In the future we should also support data from the sample-specific columns, which should be added to a new column annotations
in genotypesampledata
.
In this MR I've attempted to focus logic changes to only import config and annotation importer. All converters are simply moved from their previous location, with the exception of the (now redundant) frequency conversion, which is handled by keyvalue
-converters. Tests have also mainly been moved. This means that code review is mainly needed on src/vardb/deposit/importers.py
The table annotationconfig
holds all configs, where the latest added row is the active one, meaning this will be used for all new annotation being imported.
In addition, the annotation table holds the id of the annotationconfig that was the active one at the time of import. This will help down the road, when we add annotation view config
TODO:
-
Modify annotation JSON schema: Keep transcripts and references. Frequency? Other keys? -
Add json schema for import config: available converters, config values etc -
CLI command for updating annotation config -
Typing
Suggested out of scope for this MR:
-
CLI command for testing annotation config -
CLI command for creating config?
Tests
General
-
Tests have been added that prove my fix is effective or that my feature works -
Related tests have been modified/removed
Hypothesis testing:
-
Soak testing has been done -
Distribution between positive / negative cases has been checked
Database
-
Includes changes to database schema -
Includes necessary database migrations
Configuration
-
Includes changes to configuration -
Includes configuration migration instructions in documentation
Merge checklist
-
Self-review of code has been performed. -
Feature review and validation against specification has been performed (if applicable). Apply label: QAdone -
Need for documentation has been evaluated and, if necessary, updated. Apply label: docsdone -
Code and implementation has been reviewed by other core developer (including any changes based on initial review). Apply label: code reviewdone
Closes #140 (closed) Closes #171 (closed)