inversionfix.py and biodiff should work with inversions now

inversionfix.py now outputs temp fasta with inversions undone, and a
temp vcf noting the inversions. I checked its output with 4-0010 and
the temp file has its sequence inverted compared to the copy in
$GROUPHOME/data/genomes where I expect it.

biodiff now calls dnadiff and inversionfix with appropriate arguments
for the isolate its running, and uses the temp fasta as the query.
However biodiff still cannot finish when running on 4-0010, nor can
the original biodiff.

Next I will have inversionfix.py change the transposons, hopefully
then biodiff will be able to run on 4-0010. If not, I will attempt
debugging.

Eventuallt I will write a script to take in udiff2vcf's standard
output and integrate it with the temp vcf from inversionfix.py,
and label the GAPs as insertions and deleations as appropriate.
parent 823f00db
......@@ -7,6 +7,7 @@ VERSION=0.2.2
reference="$1"
query="$2"
unset tmpdir
tmpdir=$(mktemp -p /tmp -d biodiff.XXXXXX)
refdir="$tmpdir/ref"
......@@ -51,14 +52,16 @@ then
fi
# here we can put the dnadiff call etc
#isolate=${query%.fasta} # strip .fasta from querry to get isolate id for -p value
#dnadiff -p $isolate $reference $query
#python inversionfix.py -r $isolate.rdiff -q $isolate.qdiff -v $isolate-temp.vcf -t $isolate-temp.fasta -f $query
isolatenpath=${query%.fasta} # get isolate ID by stripping suffix
isolate=${isolatenpath##*/} # and stripping path
dnadiff -p $isolate $reference $query
echo "dnadiff finished"
python /home/dconkleg/workspace/biodiff/inversionfix.py -r $isolate.rdiff -q $isolate.qdiff -v $isolate-temp.vcf -f $query -t $isolate-temp.fasta
# inversionfix.py will need to be made an exacutable in the path eventually
# call $isolate-temp.fasta instead of $query from now on below
read_fasta "$refdir" $reference
read_fasta "$qrydir" $query
read_fasta "$qrydir" $isolate-temp.fasta
# If we have one reference and one query, don't require matching identifiers
if [ $(ls "$refdir" | wc -l) = 1 ] && [ $(ls "$qrydir" | wc -l) = 1 ]
......
#!/usr/bin/env python
import os
import argparse
from Bio import SeqIO
GROUPHOME = os.environ['GROUPHOME']
def main():
#python ../inversionfix.py -r 4-0010.rdiff -v 4-0010-temp.vcf -q 4-0010.qdiff -f $GROUPHOME/data/genomes/4-0010.fasta -t 4-0010-temp.fasta
parser = argparse.ArgumentParser(description='finds inversions and adds them to vcf output of biodiff',
epilog='Runs nucmer and show-diff on fasta files to find inversions')
parser.add_argument('-r', '--rdiff', help='dnadiff output, reference coordinates', default='out.rdiff')
parser.add_argument('-q', '--qdiff', help='dnadiff output, query coordinates', default='out.qdiff')
parser.add_argument('-v', '--vcf', help='temp vcf file', default='out-temp.vcf')
parser.add_argument('-t', '--tfasta', help='temp fasta file', default='out-temp.fasta')
parser.add_argument('-f', '--qfasta', help='query fasta file')
parser.add_argument('-t', '--tfasta', help='temp fasta file', default='out-temp.fasta')
args = parser.parse_args()
print(args.tfasta)
chromname = 'snps_indels|NC_001416.1|'
# parse args.rdiff and make args.vcf with just the inversions, translocations, etc. in vcf format (no header)
# which biodiff will eventually combine with its own vcf from the other differences it finds
vcflines = []
with open(args.rdiff) as rfile:
for line in rfile:
cleanline = line.strip('\n')
rdiff = cleanline.split('\t')
#rdiff[2] is genome position, rdiff[1] is variation type (GAP, INV, etc), rdiff[4] is length (in reference)
if rdiff[1] == 'INV' or rdiff[1] == 'JMP' or rdiff[1] == 'BRK':
vcfstring = chromname + '\t' + rdiff[2] + '\t.\t.\t.\t.\tPASS\t' \
+ rdiff[1] + '|' + 'Length:' + rdiff[4] + '\n'
vcflines.append(vcfstring)
with open(args.vcf, 'w') as vfile:
for line in vcflines:
vfile.write(line)
# parse args.qdiff to get inversions in querry coordinates
# parse args.qfasta to get fasta of querry sequence,
# undo the inversions found in args.qdiff
# output noninverted fasta to args.tfasta, which biodiff will then use
inversions = [] # stores locations of inversions in querry sequence
translocations = []
breaks = []
with open(args.qdiff) as qfile:
for line in qfile:
cleanline = line.strip('\n')
qdiff = cleanline.split('\t')
if int(qdiff[4]) < 0: # negative inversions have earlier genome position as END position. fixing
vardict = dict(Start=int(qdiff[3]), End=int(qdiff[2]), Length=int(qdiff[4]))
else:
vardict = dict(Start=int(qdiff[2]), End=int(qdiff[3]), Length=int(qdiff[4]))
if qdiff[1] == 'INV' and abs(vardict['Length']) > 2:
inversions.append(vardict)
print(vardict)
if qdiff[1] == 'JMP':
translocations.append(vardict)
if qdiff[1] == 'BRK':
breaks.append(vardict)
qrecord = SeqIO.parse(args.qfasta, 'fasta').next() # read in the fasta file, should read only the first contig
qseq = qrecord.seq # get the sequence of the querry genome
for invdict in inversions:
before = qseq[0:invdict['Start']]
invseq = qseq[invdict['Start']:invdict['End']]
after = qseq[invdict['End']:]
fixedinv = invseq[::-1]
qseq = before + fixedinv + after
# next fix translocations
# write out temp fasta
qrecord.seq = qseq
SeqIO.write(qrecord, args.tfasta, 'fasta')
if __name__ == '__main__':
main()
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment