Add Pangolin to pipeline

Usage

Pangolin takes a fasta file as input, which may include multiple sequences. In the example output below, the fasta header is outputted in the taxon column.

There is no configuration or specific parameters needed, we use the defaults: pangolin path/to/fasta

With every update of Pangolin all covid19 sequences should be rerun to correctly classify them based on the latest update.

In the database we should include the output corresponding to lineage and probability. Lineage is the named variant of the virus and probability is the statistical significance of the given lineage. pangoLEARN_version is not needed as it is available by running pangolin -pv.

Example output (lineage_report.csv):

taxon,lineage,probability,pangoLEARN_version,status,note
MN994468.1,B,1,2021-01-16,passed_qc,
MN997409.1,A,1,2021-01-16,passed_qc,
MW430967.1,B.1.1.7,1,2021-01-16,passed_qc,16/17 B.1.1.7 SNPs
MN988668.1,None,0,2021-01-16,fail,N_content:0.97

Note: status should not be fail.

Docker image

https://hub.docker.com/r/staphb/pangolin/tags?page=1&ordering=last_updated

Pangolin crashes when the a header longer that 252 characters is found in the input file. So a solution is to trim the headers to contain only the id. Example:

#sudo cut -d ' ' -f1 sars_covid19_2.2_assembly.fa | sudo tee cut.fasta > /dev/null
cut -d ' ' -f1 sars_covid19_2.2_assembly.fa > cut.fasta
docker run --rm -v /srv/mar/SarsCovid19DB/versioned_genomes/2.2:/in:Z -v /srv/mar/SarsCovid19DB/versioned_genomes/2.2/pangolin:/out:Z  staphb/pangolin:2.1.7 pangolin -t 1 -o /out /in/cut.fasta

Time

Input	1 thread	4 threads
v2.2	46 min	41 min
random fasta	45s	44s

Takes MUCH longer if run record by record. => Need to run on the concatenated FASTA file.

TODO

Need to redo the release sub-pipeline:

Concatenate sequence files
Run global tools (such as pangolin)
Update records (includes touch)
Start global services (such as sequence_server)

Todo:

Add two attributes: pangolin:lineage and pangloin:proba
Implement a command to get all the ids present in the to-be-released revision
Add it to the pipeline
Concatenate sequence files using this list of ids
Run pangolin. The lineages needs to be mapped to https://cov-lineages.org/lineages/lineage_[lineage].html (ex: https://cov-lineages.org/lineages/lineage_B.1.1.7.html)
Implement a command to update records. The input is a JSON file containing an array of updates which just needs to be fed as-is into the mongodb driver. We do not consider this as updating the record the valid field is not changed. The update_date may be updated however.
Insert it in the pipeline.
Make sure pangolin results are inserted with < ECO:0000363 or ECO:0005650 >analysisUrl. analysisUrl is described in #109 (closed)

Edited Jan 26, 2021 by mma227