Potential variant filtering issue results in misclassification of B.1.1.7 positive sample
Hi,
I am using ncov_minipipe (version 3.0.1-2 release) and was struggling with a B.1.1.7 positive sample (confirmed by PCR, COG_UK pipeline and also manual inspection of the BAM file) that was not detected as being B.1.1.7 by ncov_minipipe (instead B.1.1.161 was assigned by Pangolin to the generated consensus).
A closer look at the single steps of the pipeline revealed that the relevant mutations/variants of the VOC were filtered out in the variant filtering step (transition from sample.vcf -->sample.filtered.vcf) and consequently are not included in the IUPAC consensus which is finally used by pangolin for lineage assignment. The following figure shows which of the B.1.1.7 defining mutations are present in which of the (intermediate) VCFs generated in the pipeline. Here, red means that the variant (x-axis) is present in the VCF (y-axis) and grey means that it is not present. In addition depth/coverage (RD), reference (RF) and alternate allele frequencies (AF) are given as percentage values. Figure 1:VOCmutations_B.1.1.7.vcf.pdf
Interestingly all the relevant mutations defining B.1.1.7 are present in the unfiltered VCF that directly comes from the freeBayes call and are filtered out in the hard filtering step (-->sample.filtered.vcf). The underlying call for generating the filtered VCF can be found in snakemake file (create_consensus.smk):
bcftools filter -e "INFO/MQM < {params.mqm} | INFO/SAP > {params.sap} | QUAL < {params.qual}" -o "{output}" -O z "{input}"
with params.mqm=40, params.sap=60 and params.qual=10 if using the defaults (which I did) according to the documentation. Some more digging into the data revealed that in my case the SAP (strand balance probability for the alternate allele) are in many cases much higher than the excluding upper bound threshold of 60. The relevant values used in this filtering step are given for my data in the following Figure. Figure 2: VOCmutationsFilterCriteria_B.1.1.7.vcf.pdf
As you can see many of the B.1.1.7 defining variants/mutations are filtered out because their SAP value is higher than 60. Taking into account that the data looks quite fine to me and all the mutations are clearly visible in the BAM I googled a bit more about 'strand balance probability' and came to the impression that for this value holds 'the higher the better'. See for example here
Is it possible the the filtering here is wrong? Shouldn't it be INFO/SAP < {params.sap}
("<" instead of ">")? Or at least the default for params.sap (60) is too aggressive.
Btw, ncov_minipe call was:
ncov_minipipe --input /XXXXX/ --reference /XXXXXX/NC_045512.2.fasta -o /XXXXout/ --pangolin pangolin --adapter QIAseq-Adapter.fasta --primer QIAseq_SARS-CoV-2_Primer_Panel_2_2.bedpe --annotation_gff /XXXXXX/NC_045512.2.gff --cpu 8 --taxid 2697049 --kraken /XXXXX/Kraken2DB/GRCh38.p13_SC2_2021-02-08
and I know that I can manually set --var_filter_sap
. However, the default behavior is here a bit too aggressive.
Best regards, Michael