Add Pangolin to pipeline
Usage
Pangolin takes a fasta file as input, which may include multiple sequences. In the example output below, the fasta header is outputted in the taxon column.
There is no configuration or specific parameters needed, we use the defaults:
pangolin path/to/fasta
With every update of Pangolin all covid19 sequences should be rerun to correctly classify them based on the latest update.
In the database we should include the output corresponding to lineage
and probability
. Lineage is the named variant of the virus and probability is the statistical significance of the given lineage. pangoLEARN_version
is not needed as it is available by running pangolin -pv
.
Example output (lineage_report.csv):
taxon,lineage,probability,pangoLEARN_version,status,note
MN994468.1,B,1,2021-01-16,passed_qc,
MN997409.1,A,1,2021-01-16,passed_qc,
MW430967.1,B.1.1.7,1,2021-01-16,passed_qc,16/17 B.1.1.7 SNPs
MN988668.1,None,0,2021-01-16,fail,N_content:0.97
Note: status
should not be fail
.
Docker image
https://hub.docker.com/r/staphb/pangolin/tags?page=1&ordering=last_updated
Pangolin crashes when the a header longer that 252 characters is found in the input file. So a solution is to trim the headers to contain only the id. Example:
#sudo cut -d ' ' -f1 sars_covid19_2.2_assembly.fa | sudo tee cut.fasta > /dev/null
cut -d ' ' -f1 sars_covid19_2.2_assembly.fa > cut.fasta
docker run --rm -v /srv/mar/SarsCovid19DB/versioned_genomes/2.2:/in:Z -v /srv/mar/SarsCovid19DB/versioned_genomes/2.2/pangolin:/out:Z staphb/pangolin:2.1.7 pangolin -t 1 -o /out /in/cut.fasta
Time
Input | 1 thread | 4 threads |
---|---|---|
v2.2 | 46 min | 41 min |
random fasta | 45s | 44s |
Takes MUCH longer if run record by record. => Need to run on the concatenated FASTA file.
TODO
Need to redo the release sub-pipeline:
- Concatenate sequence files
- Run global tools (such as pangolin)
- Update records (includes touch)
- Start global services (such as sequence_server)
Todo:
-
Add two attributes: pangolin:lineage
andpangloin:proba
-
Implement a command to get all the ids present in the to-be-released revision -
Add it to the pipeline -
Concatenate sequence files using this list of ids -
Run pangolin. The lineages needs to be mapped to https://cov-lineages.org/lineages/lineage_[lineage].html
(ex: https://cov-lineages.org/lineages/lineage_B.1.1.7.html) -
Implement a command to update records. The input is a JSON file containing an array of updates which just needs to be fed as-is into the mongodb driver. We do not consider this as updating the record the valid
field is not changed. Theupdate_date
may be updated however. -
Insert it in the pipeline. -
Make sure pangolin results are inserted with < ECO:0000363 or ECO:0005650 >analysisUrl
.analysisUrl
is described in #109 (closed)