A parallelized tool to align large sequence files against a local database. This project was originally designed to align mitochondrial reads from total genomic libraries against a mitochondrial genome database.
Sequences will be aligned against a local database (using the program BlastN and a giveN e-value cut-off). The user may decide to select all aligned reads (criteria "1") or only the reads in which both pairs have been aligned to the local database (criteria "0", used by default). The selection criteria may be changed using command argument "-c" or "--criteria" in extractAligned.
The program align2db takes one sequence file in FASTA or FASTQ format. Files may be compressed with GZIP. The program extractAligned takes one or two sequence files in FASTA or FASTQ format. Files may be compressed with GZIP. Criteria "0" requires two input files. The program extractAligned also takes one or two text files with sequence ids and e-values (as printed by align2db into STDOUT).
The program align2db will print all aligned reads to Python's standart error (STDOUT).
The program extractAligned will print all selected reads directly into sequence files. If the option "-p" or "--pruned" is used, un-selected sequences will also be saved into different files.
NOTE: If the verbose option is on, output verbosity will increase considerably (most likely resulting in additional time costs).
(1) Argument options for align2db:
-q, --query: query sequence
-d, --database: the name of the local Blast database
-f, --format: the input sequence format (FASTA or FASTQ; default = fastq)
-o, --output: the prefix for the XML files produced by BlastN (default = alignments)
-e, --evalue: the e-value cutoff (default = 1e-5)
-p, --processes: number of processes to run simultaneously (default = 4)
-z, --gzip: use for compressed GZIP files (default = off)
-k, --keep_xml: keep alignments in XML format (default = off)
-V, --version: print version number and quit (default = off)
Requires Python 2.6.X or newer versions.
The following modules/ libraries are required:
SeqIO from Bio
argparse, os and sys
Imagine an experiment which generates reads in FASTQ format for total genomic reads (i.e. nuclear DNA, mitochondrial DNA and maybe other sequences as well). Also imagine the experiment results in paired-end reads that are saved in compressed GZIP files names pair1.fastq.gz and pair2.fastq.gz.
TIP: You can create local databases with the makeblastdb program from the NCBI BLAST+
Copy all the necessary files into to the working directory.