Commit 4734ec40 authored by bue's avatar bue
Browse files

@ annot : smoke test evolution

parent 79bba650
Loading
Loading
Loading
Loading
+141 −102
Original line number Diff line number Diff line
# Discussion

## Why Annot?

For a discussion why annot was developed please read our [publication](https://www.youtube.com/watch?v=dQw4w9WgXcQ)

## About not_yet_dspecified and no_available
not_yet_specified and not_available
A caveat we found in simple Excel or Google Docs annotated metadata are empty fields. 
An empty field let it completely open if the information not yet was specified, but the annotator aimed to go back and annotate the field later or if there is no such information available at all. 
Consequently Annot sets every empty field for first to not_yet_specified.
A sample or reagent, which has a not_yet_specified field in the primary key block, will in general not be bridgeable.
However, if primary key field are marked as not_available, for example for reagents we do not care about the exact provider or catalog number or lot number like DMSO. The reagent will be bridgeable.

## About Prootein Complexes, Proteons, Isoforms and such
Now protein ids are not all that simple as the example above, because some of the proteins, other then the bovine insulin above, have known isoforms. For this case we made use of the third of our hierarchical separation character, the pipe symbol (|). This resulted for example in protein isoform identifier like this: INS_P01308|1 (the canonical human insulin isoform)

Note: Isoforms identifiers officially separated by a dash form the protein identifier at uniprot
(e.g. P01308-1) which was not an impossible character choice for An! identifier.
In practice, if in the detailed protein reagent description no exact isoform can be identified, we agreed to choose always the uniprot defined canonical isoform. In addition we added an isoform_explicite boolean filed to the protein brick where explicitly can be specified if the reagent is a true isoform or most probably a mixture of isoforms, but because of simply the canonical form was given.

If you are bioinformatician you will most probably be familiar with the task to integrated protein related data, annotated by HUGO gene nomenclature identifiers (HGNC).
First thing you will do is to go to biomoart.ensemble.org to transform the hugo id’s into proper, computer readable identifiers. 
There by you will for sure loose some of the identifiers.  A reason for this lost might be that the HGNC term used went obsolete. But we found the more common case is that so called HGNC terms like COL1, ITGA2B1 or Laminin3B32 are not terms for single proteins but for proteinsets. A proteinset is a complexe that is built out of two or more different proteins. As such there will be no single HGNC or uniprot id, but most probably a proper GO gene ontology term and identifier will be available in the cellular_component branch (COL1_go0005584, ITGA2B1_go0034666, Laminin3B32_go0061801). 
This is the reason why in An! proteins and proteinsets  are handled in two separate but relational related bricks.

All in all can the trained eye for each protein related An! identifier under discussion immediately spot, if it is a single protein (INS_P01317) or a in fact a proteins set (COL1_go0005584). And when it is a protein, if the protein has known isoforms (INS_P01308|1 versus INS_P01317). Not immediately, but by consulting the related brick it can even be said form which species the protein is form (INS_P01317 is form Cow_9913, INS_P01308|1 is Human_9606) and in case of a isoform, if it is this explicit isoform or if just the canonical isoform as mentioned. 
For a discussion why annot was developed please read our [publication]().


## About Controlled Vocabulary
note: the last underscore
note: the pipe symbol

Pleas check out *HowTo handle controlled vocabulary*

Annot enforces controlled vocabulary.
This means all terms used for annotating samples or reagenties can be
tracked back to controlled vocabulary from a specific ontology.
Annot enforces controlled vocabulary. This means all terms used for
annotating samples or reagents can be tracked back to controlled vocabulary
from a specific ontology.

Unfortunately are the human readable terms familiar to the wetlab scientists
not always the terms used by the ontology. Additionaly we learnd (the hard way)
that ontology terms over time can change or even get depricated.
What more likely (though not always) stays the test of time
and is often more precise the ontoloy term are ontlogy based identifier.

So, annot's solution to this problems is to generate for each term an annot_id,
which links ontology term and id together by an uderscore. For example
bovine insuline from uniprot transfors into annot_id [INS_P01317](http://www.uniprot.org/uniprot/P01317).
The term part of the annot idenfifier can then, if needed, be adjusted to
a term everyone in the lab is familiar with.
The ontology identifier on the other hand should stay untouched.
A caveat of this solution however is, that the adjustment of the annot_id
have to happen, before the term is used for sample or reagent annotation.

Should it ever happen that a used id from an ontology get depricated then the
particular term will not be deleted but in appsabbrick (red colored in the GUI)
in the EndpointBricked or PerturbationBricked or SampleBricked table the
ok_brick field will be set to False (which appears as white x in a red dot in the GUI),
and the ontology_term_status field in the coresponding ontology (orange colored in the GUI)
not always the terms used by the ontology. Additionally we learn the hard way
that ontology terms over time can change or even get deprecated. What more
likely (though not always) stays the test of time and is often more
precise the ontology term are ontology based identifier.

So, annot's solution to this problems is to generate for each term an annot id,
which links ontology term and id together by an _ underscore. For example
bovine insulin from uniprot transforms into annot id [INS_P01317](http://www.uniprot.org/uniprot/P01317).
Annot id’s are further restricted to use only alphanumeric character and the
underscore. The official ontology identifier is alway to found behind the
last underscore. This is handy. In python3 `annot_term_id.split(‘_’)[-1]` will
always return you the official identifier. The term part of the annot
identifier can, if needed, be adjusted to a term everyone in the lab is
familiar with. The ontology identifier on the other hand should stay
untouched. For example the official term hyaluronic_acid_chebi16336 turned
in or lab into HA_chebi16336.
A caveat of annot's solution however is, that the adjustment of the
annot id have to happen, before the term is used for sample or reagent
annotation.

Should it ever happen that an id from an ontology get deprecated then the
particular term will not be deleted but in `appsabbrick` (red colored in the GUI)
in the `EndpointBricked` or `PerturbationBricked` or `SampleBricked` table the
`ok_brick` field will be set to False (which appears as white x in a red dot in the GUI),
and the `ontology_term_status` field in the corresponding ontology (orange colored in the GUI)
will be set to False.

In the case needed terms are missing from a particular ontology,
terms can be added by clicking the 'add unit' button in the particular ontology.
Consider to contact the related ontology and reqest to add the missing term.

In annot only the controlled vocabuly backup version will store adapted
annot_ids, depricated but need ontology identifiers and own add ontology terms.
the origin version contains information pulled form the original source.
Backup and orioginal version files can be found inside annnot at /usr/src/media/vocabulary/ .
Read howto backup (in this manual in the HowTo section) for deeper insigths
about howto backup the vocabulary in use.

Whenever possible we took controlled vocabulary form existing, well established ontologies. However, there were terms we could not find in an appropriate otology for.
For example all terms from the apponprovider_own vocabulary.
These terms we generated our self. All of these terms will have “Own” as term id, so there are easily detectable. (e.g. Boots_Own)
In the case needed terms are missing from a particular ontology, terms can be
added by clicking the `Add` button in the particular ontology (orange colored in the GUI).
Consider to contact the related ontology and request to add the missing term.

In annot the controlled vocabulary origin version contains just the latest
information pulled form the original source. only the backup version will store
adapted annot ids, deprecated ontology identifiers and own added ontology terms.
Original version and backup files can be found inside annnot at `/usr/src/media/vocabulary/`.

Whenever possible we took controlled vocabulary form existing, well established
ontologies. However, there were terms we could not find in an appropriate ontology.
For example all terms from the apponprovider_own vocabulary. These terms we
generated our self. All of these terms will have "Own" as term id, so there are
easily detectable. For example Boots_Own for the boots pharmacy.

Further readings:
+ HowTo handle controlled vocabulary?
+ HowTo deal with huge ontologies?
+ HowTo get detailed information about the ontologies in use?
+ HowTo backup annot content?


## About Proteins, Protein Isoforms and Protein Complexes

Now protein ids are a bit of a special case, because some of the proteins,
other then the bovine insulin used as example above, have known isoforms.
For such case we introduced a additional hierarchical separation character,
beside the underscore, the | pipe symbol. This resulted for example in protein
isoform identifiers like INS|1_P01308|1, the canonical human insulin isoform.
Please note that in UniProt the isoforms identifiers is officially separated
by a dash form the protein identifier (e.g. P01308-1). But the - dash
was an impossible character choice for annot as it was already taken by the
annot brick identifiers to separate the primary key parts from each other.
For example the brick id INS_P01317-BootsOwn_F49US_76400575 is the id from
cow insulin provided by boots pharmacy, catalog nunmber F49US, lot number 76400575.
The nice thing on having such hierarchical separator character is that, for
example in python3 `annot_brick_id.split(‘-’)[0]` will always return you the
annot term id, which then can be take further apart.

If in the protein reagent description no exact isoform could be identified,
we agreed in our lab to choose always the UniProt defined canonical isoform.
In addition we added an isoform_explicit boolean field to the protein brick
where explicitly can be specified if the reagent is a true isoform or most
probably a mixture of isoforms, but simply the canonical form was given
because the explicit isoform was unknown.

UniProt only covers up to the protein level. However, several of the
reagents we perturbed our sample where not single proteins but protein
complexes. For example COL1, ITGA2B1 or Laminin3B32. For protein complexes
we used the Gene Ontology cellular component identifiers which resulted in
annot ids like COL1_go0005584, ITGA2B1_go0034666, Laminin3B32_go0061801.

So, in annot it is super easy to spot if you deal with a protein or a protein
complex or if the protein you deal with has known isoforms. It needs a bit
more digging to figure out from which species the protein comes from for
example Human_9606 or Cow_9913. But this is annotated in the bricks in the
`Protein DNA code source` field.

Further reads:
+ HowTo annotate protein complexes?


## About Sample and Reagent Identifiers
note: about the -_| hierarchy
note: the dash

We aimed for sample and reagent identifier, which were tractable down to the lot number level, if needed.   
So what we did is to build our total identifier hierarchical out of controlled vocabulary identifier blocks (such as INS_P01317 mentioned above), by consequently connecting the controlled vocabulary blocks by dashes (-).
This resulted in identifiers like: INS_P01317-Boots_Own-76400575.

Unfortunately this concept resulted in rater long, but we think still manageable identifiers. 

So, in Python3, if you simply need the sample or reagent name to label your plot, a annot_id.split(“-“)[1]  will give you that. 
If you are interested in the uniprot id of your protein, a simple annot_id.split(“-“)[1] .split(“-”)[-1]].replace(“|”,”-”) will provide you this.
And if you instead are interested solely in the protein name with out uniprot identifier (keep I mind that the name it self can have none to many underscores) annot_id.split(“-“)[1] .pop(“-”)[-1] will  provide you this.

(Frankly, we are not aware of another such systematic yet human and computer readable identifier nomenclature.
Above all we think the advantages of a systematically hierarchical build identifier will pay off in future data integration.)

We here mainly talked about proteins. Similarly we generated systematic structured primary keys, in our implementation always called annot_id, for primary antibodies (antigen, host organism, provider, catalog number, lot number), secondary antibodies (host organism, target organism, dye, provider, catalog number, lot number), compounds (compound, provider, catalog number, lot number), compound stains (compound, dye, provider, catalog number, lot number), proteinsets (proteinset, provider, catalog number, lot number) and samples (sample name, provider, catalog number, lot number).
For details please have a look at the source code.

A sample or reagent, which has a not_yet_specified field in the primary key block, will in general not be bridgeable.  However, if primary key field are marked as not_available, for example for reagents we do not care about the exact provider or catalog number or lot number like DMSO. The reagent will be bridgeable.
So, it is entirely up to the user how deep down he likes to track his samples and reagents annotation. Name and identifier might be in some cases accurate enough, where in other cases the to knowledge of the exact lot number is required. not_available is always a valid choice. 
With annot it is even possible to track sample down to the passage number level and reagents down to the aliquot level.
However, this is not done at the brick level, but at the bridging level – inside the layout files. It is possible, but it might cause some extra work, especially when the same layout is used often. 
Now, as good as it may sound, does it really make sense to track sample and reagents down to the deepest level possible?
Our experience for basic research is, that in does not make really sense to track sample on reagents below the lot number level. Maybe sometimes it not even makes sense to track the data down to the lot number level.
It is a burden you put on your wet lab staff. And in basic research the experiment is anyway repeated to confirm finding.
It looks however a bit different for data that is produced for a project like LINCS. Data that will hopefully studied and integrated a lot after it is generated and normalized. There it really makes sense to track the reagent lot number and the cell passage number. 
Choose your tracking level wisely. Chose what make sense.


## Author Contribution
## About not_yet_dspecified and no_available

Elmar Bucher: main programmer
Cheryl Claunch:
Derrick He:
A caveat we found in simple spreadsheet annotated metadata are empty fields.
An empty field let it completely open if the information not yet was specified,
but the annotator aimed to go back and annotate the field later or if there is
no such information available at all. Consequently Annot sets every empty field
for first to not_yet_specified.

A sample or reagent, which has a not_yet_specified field in the primary key
block, will in general not be brickable. If however primary key fields are
marked as not_available, for example for reagents we do not care about the
exact provider or catalog number or lot number like DMSO_chebi28262-notavailable_notavailable_notavailable.
The reagent will be brickale.


## HowTo annotate protein complexes?
# bue 20180407: this belongs to the howto section. but Dave has at the moment the howto section for correction.
In the GUI:
1. scroll to the green colored `Appbrreagentprotein`  section.
1. click `Perturbation Proteinset`.
    1. under `Protein set` choose the gene ontology cellular component identifier
      for the protein complex you wane annotate. E.g. COL1_go0005584.
    1. choose the `Provider`.
    1. enter `catalog id`.
    1. enter `batch id`.
    1. adjust `Availability`, `Final concentration unit` and `Time unit` if necessary.
    1. click `Save`.

    Now Collagen 1 is a protein complex built out of
    two COL1A1_P02453 Collagen alpha-1 (I) chain proteins
    and one COL1A2_P02465 Collagen alpha-2 (I) chain protein.
    Both of this proteins have to be annotated.

1. click `Perturbation Protein` and enter both proteins as usual.
   + Most important: under `Protein set` choose the proteinset generated before.
   + Not so important, but you still should do it: enter the `Proteinset ratio`
     2:1 because it is known.
   + Our lab convention are: Set `Availability` to False, because the
     single protein as such is not available.
   + Our lab convention are: Give the `Stock solution concentration` for the
     whole protein complex, do not divide by protein ration, because there are
     protein complex reagents where the exact ratio is unknown.

1. now you should be able to brick this COL1_go0005584 protein set.


## Programer Contribution

+ Elmar Bucher: main programmer.
+ Cheryl Claunch: co-programmer to bring version 4 alive.
+ Derrick He: cron job backup routine implementation.


## Contact Information

Send questions and love letter you to: buchere at ohsu dot edu.
Pizza napoli or margareta delevered goes to: Elmar Bucher, 3N04605.
Money donation goes to: [medicines sans frontier](https://www.doctorswithoutborders.org/).
Contact bue at https://gitlab.com/biotransistor/annot
or send an email to buchere at ohsu dot edu.
+3 −2
Original line number Diff line number Diff line
@@ -308,6 +308,7 @@ about the subject. This section just deals with the available annot commands.
  content with it. This command will break if a online ontology fails to be
  downloadable.

# bue 20180402: this have chnage, but now dave is working on the documentation
+ `nix/cron_vocabulary.sh` is a shell script, written to back up each and every
  vocabulary one by one. Before annot backs up any vocabulary, it first update
  the vocabulary to the latest ontology version available. This script will not
+0 −1
Original line number Diff line number Diff line
@@ -19,7 +19,6 @@ class Study(models.Model):
    study_name = models.SlugField(
        max_length=256,
        default="not_yet_specified",
        verbose_name="Study title",
        help_text="A concise phrase used to encapsulate the purpose and goal of the study."
    )
    date_submission = models.DateField(
+1 −1
Original line number Diff line number Diff line
@@ -626,7 +626,7 @@ def objjson2lincs(s_filename, dd_json, ls_column, b_verbose=True):
                d_out.update({"PR_Comments": "single protein"})  # protein

            elif (s_column == "PR_UniProt_ID") and (d_record["proteinset_annot_id"] == "not_available-notavailable_notavailable_notavailable"):
                d_out.update({"PR_UniProt_ID": d_record["protein"].split("_")[1].replace("|",".")})  # split
                d_out.update({"PR_UniProt_ID": d_record["protein"].split("_")[1].replace("|","-")})  # split

            elif (s_column == "PR_Protein_Complex_Known_Component_UniProt_IDs") and (d_record["proteinset_annot_id"] == "not_available-notavailable_notavailable_notavailable"):
                d_out.update({"PR_Protein_Complex_Known_Component_UniProt_IDs": "not_available"})  # nop