The scripts used in the processing of the supporting genomic and transcriptomic data are available in the appropriate subdirectories of the [processing repository](https://gitlab.com/PlantGenomicsLab/annotationtool/tree/master/process). These are kept in groups based on processing step. [This page](https://gitlab.com/PlantGenomicsLab/annotationtool/wikis/Scripts-Descriptions)(here) is an atlas of script function, listed alphabetically by script name within each step category, which are listed sequentially.
The scripts used in the processing of the supporting genomic and transcriptomic data are available in the appropriate subdirectories of the [processing repository](https://gitlab.com/PlantGenomicsLab/annotationtool/tree/master/process). These are kept in groups based on processing step. [This page](https://gitlab.com/PlantGenomicsLab/annotationtool/wikis/Scripts-Descriptions)(here) is an atlas of script function, listed alphabetically by script name within each step category, which are listed sequentially.
fetchSRA.sh | gather data | Gather data from multiple NCBI SRRs (**S**equence **R**ead archive **R**un accessions) | Text file containing the specific SRR runs that you want to download | raw reads in .fastq format | the option `--split-files` in line 22 must be removed if gathered data is not paired
fetchSRA.sh | gather data | Gather data from multiple NCBI SRRs (**S**equence **R**ead archive **R**un accessions) | Text file containing the specific SRR runs that you want to download | raw reads in .fastq format | the option `--split-files` in line 22 must be removed if gathered data is not paired
validateSRA.sh | gather data | Validates that the correct files have been downloaded from NCBI | Text file containing the specific SRR runs that you want to download | an output and error file confirming that all runs were downloaded correctly |
validateSRA.sh | gather data | Validates that the correct files have been downloaded from NCBI | Text file containing the specific SRR runs that you want to download | an output and error file confirming that all runs were downloaded correctly |
filtersubmit.sh | filter | run filterLen.py with the correct module and input settings to remove small scaffolds from genome assembly | genome assembly of interest | genome assembly excluding scaffolds < 500 bp |
filtersubmit.sh | filter | run filterLen.py with the correct module and input settings to remove small scaffolds from genome assembly | genome assembly of interest | genome assembly excluding scaffolds < 500 bp |
repeatModeler.sh | mask | Use [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) to generate a de novo repeats library for a genome | filtered genome assembly in fasta format | repeats library suffixed conseni.fa.classified |
repeatModeler.sh | mask | Use [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) to generate a de novo repeats library for a genome | filtered genome assembly in fasta format | repeats library suffixed conseni.fa.classified |
...
@@ -25,15 +32,16 @@ concat.sh | mask | Concatenate all of the pieces of the split and softmasked gen
...
@@ -25,15 +32,16 @@ concat.sh | mask | Concatenate all of the pieces of the split and softmasked gen
repeatmasker.sh | mask | Use [RepeatMasker] to softmask the regions in the genome recognized as repetitive | a piece of the filtered genome, repeat library generated by [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) | softmasked piece of genome |
repeatmasker.sh | mask | Use [RepeatMasker] to softmask the regions in the genome recognized as repetitive | a piece of the filtered genome, repeat library generated by [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) | softmasked piece of genome |
splitfasta.sh | mask | Reduce the running time for repeatmasker by splitting the genome into pieces | filtered genome in fasta format | multiple pieces of filtered genome |
splitfasta.sh | mask | Reduce the running time for repeatmasker by splitting the genome into pieces | filtered genome in fasta format | multiple pieces of filtered genome |
frameSelect.sh | TSA Prep | Uses [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) to identify coding regions in the transcript assemblies and translate into peptide sequences | TSA fasta file | BED, GFF3, CDS (nt coding sequence) & peptide files representing recovered coding regions |
frameSelect.sh | TSA Prep | Uses [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) to identify coding regions in the transcript assemblies and translate into peptide sequences | TSA fasta file | BED, GFF3, CDS (nt coding sequence) & peptide files representing recovered coding regions |
usearch.sh | TSA Prep | Uses [USearch v9.0](https://www.drive5.com/usearch/manual9/) to cluster multiple frame-selected TSAs (**T**ranscriptome **S**hotgun **A**ssembly) by sequence homology into a consensus transcriptome | A single fasta made of concatenated frame-selected TSAs | Clustered reference transcriptome | .
usearch.sh | TSA Prep | Uses [USearch v9.0](https://www.drive5.com/usearch/manual9/) to cluster multiple frame-selected TSAs (**T**ranscriptome **S**hotgun **A**ssembly) by sequence homology into a consensus transcriptome | A single fasta made of concatenated frame-selected TSAs | Clustered reference transcriptome |