inversionfix.py now works with inversions, but not nested inversions.

inversionfix.py interprets dnadiff output more accurately, and undoes
inversions using reverse complementation. When the integrated biodiff
ran on an test querry and reference with 2 known inversions,
inversionfix.py correctly identified the inversions, and created a
temp file with the inversion fixed it then passed to the rest of biodiff,
resulting in a cleaner output vcf from biodiff. There were a few bases
along the edges of the inversions that were left unaltered, resulting in
a couple entries from biodiff, but only a few. When run on isolate 1-0007
inversionfix.py finds and reverses some inversions.

However for isolate 4-0010, which has a large inversion with several
smaller inversions overlapping it, inversionfix.py fails to correct
them properly. It interprets the start of a sub inversion as the end
of the current inversion, resulting in the large inversion being
intrpretted as a series of consecutive inversions. Thus the syntenny
of the components is still flipped, even if their contents are now
the right way around. I do not yet know if dnadiff records any
information reguarding the nested structure of these inversions, or
if I will need to find another method to find and reverse nested
inversions. Perhaps someone has already written a paper...
parent 813a455a
......@@ -17,7 +17,7 @@ def main():
parser.add_argument('-t', '--tfasta', help='temp fasta file', default='out-temp.fasta')
args = parser.parse_args()
print(args.tfasta)
print(args)
chromname = 'snps_indels|NC_001416.1|'
# parse args.rdiff and make args.vcf with just the inversions, translocations, etc. in vcf format (no header)
# which biodiff will eventually combine with its own vcf from the other differences it finds
......@@ -43,33 +43,46 @@ def main():
translocations = []
breaks = []
with open(args.qdiff) as qfile:
line2inv = 0
for line in qfile:
cleanline = line.strip('\n')
qdiff = cleanline.split('\t')
if int(qdiff[4]) < 0: # negative inversions have earlier genome position as END position. fixing
vardict = dict(Start=int(qdiff[3]), End=int(qdiff[2]), Length=int(qdiff[4]))
else:
vardict = dict(Start=int(qdiff[2]), End=int(qdiff[3]), Length=int(qdiff[4]))
if qdiff[1] == 'INV' and abs(vardict['Length']) > 2:
if qdiff[1] == 'INV' and line2inv == 0: # INV ifo in 2 lines. first line has Start, 2nd has End
vardict = dict(Start=int(qdiff[2]))
inversions.append(vardict)
print(vardict)
if qdiff[1] == 'JMP':
translocations.append(vardict)
if qdiff[1] == 'BRK':
breaks.append(vardict)
line2inv = 1
elif qdiff[1] == 'INV':
inversions[-1]['End'] = int(qdiff[2])
line2inv = 0
# with open(args.qdiff) as qfile:
# for line in qfile:
# cleanline = line.strip('\n')
# qdiff = cleanline.split('\t')
# if int(qdiff[4]) < 0: # negative inversions have earlier genome position as END position. fixing
# vardict = dict(Start=int(qdiff[3]), End=int(qdiff[2]), Length=int(qdiff[4]))
# else:
# vardict = dict(Start=int(qdiff[2]), End=int(qdiff[3]), Length=int(qdiff[4]))
# if qdiff[1] == 'INV' and abs(vardict['Length']) > 2:
# inversions.append(vardict)
# print(vardict)
# if qdiff[1] == 'JMP':
# translocations.append(vardict)
# if qdiff[1] == 'BRK':
# breaks.append(vardict)
qrecord = SeqIO.parse(args.qfasta, 'fasta').next() # read in the fasta file, should read only the first contig
qseq = qrecord.seq # get the sequence of the querry genome
fixedrecord = qrecord
for invdict in inversions:
before = qseq[0:invdict['Start']]
invseq = qseq[invdict['Start']:invdict['End']]
fixedinv = qseq[invdict['Start']:invdict['End']].reverse_complement()
after = qseq[invdict['End']:]
fixedinv = invseq[::-1]
qseq = before + fixedinv + after
# next fix translocations
# write out temp fasta
qrecord.seq = qseq
SeqIO.write(qrecord, args.tfasta, 'fasta')
fixedrecord.seq = qseq
SeqIO.write(fixedrecord, args.tfasta, 'fasta')
if __name__ == '__main__':
main()
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment