Configurable annotation (!405) · Merge requests · ELLA / ELLA

Svein Tore Koksrud Seljebotn requested to merge 140-support-import-custom-annotation into dev Apr 24, 2020

Description

Support import and display of configurable annotation

Import of annotation is now determined by the latest inserted row in the new annotationconfig table. The migration script creates this table and populates with an import config that reflects the current annotation import.

The file annotation-config.yml is used in the testdata, and can be used as an example of how an annotation config can be created.

Available generic annotation converters:

There are currently four available generic annotation converters.

keyvalue
json
mapping
meta

Of these, the json-converter gives the most flexibility, and keyvalue the most transparent.

All the examples below generate the same output structure to the column annotations in the annotation table:

{
   "PATH": {
       "TO": {
           "TARGET": {
               "foo": 1,
               "bar": 2
           }
       }
   }
}

keyvalue (`keyvalueconverter.py`)

Read key/value pairs from annotation.

Example:

Config:

- name: keyvalue
  converter_config:
       elements:
           - source: FOO
             target: PATH.TO.TARGET.foo
             target_type: int
           - source: BAR
             target: PATH.TO.TARGET.bar
             target_type: int

Annotation values: FOO=1;BAR=2

json (`jsonconverter.py`)

Reads base16/32/64 encoded JSON data and parses it

Example config:

- name: json
  converter_config:
     elements:
          - source: MYJSON
            target: PATH.TO.TARGET
            encoding: base16

Example annotation value: MYJSON=7B22666F6F223A20312C2022626172223A20327D

Note: base64.b16encode(json.dumps({"foo": 1, "bar": 2}).encode()).decode() == '7B22666F6F223A20312C2022626172223A20327D'

mapping (`mappingconverter.py`)

Reads character (e.g. ,) separated key/value-structures separated with e.g. :.

Example config:

- name: mapping
  converter_config:
      elements:
          - source: DABLA
            target: PATH.TO.TARGET
            item_separator: ',' # Default value
            keyvalue_separator: ':' # Default value
            value_target_type: int

Example annotation value: DABLA=foo:1,bar:2

meta (`metaconverter.py`)

Use meta information (##INFO header) to create JSON structures, where keys are fetched from the header, and values from the annotation. Requires the meta information line to match a given regex pattern for extracting keys. Example config:

- name: meta
  converter_config:
      elements:
          - source: DABLA
            target: PATH.TO.TARGET
            meta_pattern: (?i)[a-z_]+\|[a-z_\|]+ # Default: Used to fetch keys
            element_separator: "|"
            subelements:
                - source: foo
                  target_type: int
                - source: bar
                  target_type: int

Example header line (meta information): ##INFO=<ID=DABLA,Number=.,Type=String,Description="Format: foo|bar">

Example annotation value: DABLA=1|2

Available specific annotation converters:

clinvarjson -> Convert current clinvar data to form expected by database
clinvarreferences -> Read data from ClinVar json structure
hgmdextrarefs -> Read data from HGMD__EXTRAREFS
vep -> Read CSQ-field
hgmd -> Read HGMD specific fields

Ramblings

The generic annotation converters should handle most of the new annotation we could want.

However, adding new specific converters is also made a lot simpler (see for example https://gitlab.com/alleles/ella/-/blob/95026d6453a948e6fe104753bbf12a5cef217c20/src/vardb/deposit/annotationconverters/hgmd.py). There are three routes that are possible:

Add a plugin system for adding converters "on the fly".
Create more complex generic converters able to handle more complex annotation (albeit with more complex configuration)
Keep it "as is", and just add specific converters for any new "weird" annotation data.

Related issues

#140 (closed)

Notes to review (code/docs/QA)

Only the data in the INFO-column is currently supported. In the future we should also support data from the sample-specific columns, which should be added to a new column annotations in genotypesampledata.

In this MR I've attempted to focus logic changes to only import config and annotation importer. All converters are simply moved from their previous location, with the exception of the (now redundant) frequency conversion, which is handled by keyvalue-converters. Tests have also mainly been moved. This means that code review is mainly needed on src/vardb/deposit/importers.py

The table annotationconfig holds all configs, where the latest added row is the active one, meaning this will be used for all new annotation being imported.

In addition, the annotation table holds the id of the annotationconfig that was the active one at the time of import. This will help down the road, when we add annotation view config

TODO:

Modify annotation JSON schema: Keep transcripts and references. Frequency? Other keys?
Add json schema for import config: available converters, config values etc
CLI command for updating annotation config
Typing

Suggested out of scope for this MR:

~~CLI command for testing annotation config~~
~~CLI command for creating config?~~

Tests

General

Tests have been added that prove my fix is effective or that my feature works
Related tests have been modified/removed

Hypothesis testing:

Soak testing has been done
Distribution between positive / negative cases has been checked

Database

Includes changes to database schema
Includes necessary database migrations

Configuration

Includes changes to configuration
Includes configuration migration instructions in documentation

Merge checklist

Self-review of code has been performed.
Feature review and validation against specification has been performed (if applicable). Apply label: QAdone
Need for documentation has been evaluated and, if necessary, updated. Apply label: docsdone
Code and implementation has been reviewed by other core developer (including any changes based on initial review). Apply label: code reviewdone

Closes #140 (closed) Closes #171 (closed)

Edited Jun 24, 2021 by Øyvind Evju

Configurable annotation