Actual numbers of strains carrying significant unitigs are lower than those reported in the textual/visualisations output
Hi Laurent,
As the title tells, I found actual number of fasta files that contain significant unitigs are lower than that reported in the textual/visualisations output.
As an example, I ran DBGWAS on 282 Pseudomonas aeruginosa genomes described in the DBGWAS in a nutshell section.
The analysis reported a unitig with a q value of 9.730450e-09 as the most significant.
The textual/visualisations output tells that this unitig is present in 2/233 with Pheno 0 and 15/47 with Pheno 1.
The sequence of this unitig is "CCTTTGCCCAGTTGTGATGCATTCGCCAGTAACTGGTCTATTCCGCGTACTCCTGGATCGG", and I examined the presence of this unitig in the actual fasta files using the following commands.
cd pseudomonas_aeruginosa_full_dataset #pseudomonas_aeruginosa_full_dataset directory contains 282 fasta files
mkdir modified_fasta; cd $_
for f in ../*.fna; do name=$(basename ${f} .fna); cat ${f} | awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' > ${name}_modified.fna; done #remove line breaks from fasta files
for f in *.fna; do number_contig=$(cat ${f} | grep "CCTTTGCCCAGTTGTGATGCATTCGCCAGTAACTGGTCTATTCCGCGTACTCCTGGATCGG" | wc -l ); if [ ${number_contig} -ge 1 ]; then echo ${f} ; fi; done #list file names that carry this unitig
The above commands reported only 6 strains (WH-SGI-V-07165, WH-SGI-V-07425, WH-SGI-V-076137, WH-SGI-V-07687, WH-SGI-V-07323, WH-SGI-V-07329) carry this unitig. I might have misunderstood something in the analysis, but any help with this would be much appreciated.
Thank you!