The genomic and transcriptomic data used for the experiment was sourced from NCBI for most of the species used. Other data were unpublished and sourced from collaborators.
The genomic and transcriptomic data used for the experiment was sourced from NCBI for most of the species used. Other data were unpublished and sourced from collaborators.
fetchSRA.sh | gather data | Gather data from multiple NCBI SRRs (**S**equence **R**ead archive **R**un accessions) | Text file containing the specific SRR runs that you want to download | raw reads in .fastq format | the option `--split-files` in line 22 must be removed if gathered data is not paired
fetchSRA.sh | gather data | Gather data from multiple NCBI SRRs (**S**equence **R**ead archive **R**un accessions) | Text file containing the specific SRR runs that you want to download | raw reads in .fastq format | the option `--split-files` in line 22 must be removed if gathered data is not paired
...
@@ -21,12 +22,14 @@ validateSRA.sh | gather data | Validates that the correct files have been downl
...
@@ -21,12 +22,14 @@ validateSRA.sh | gather data | Validates that the correct files have been downl
#### Assembly Filtering
#### Assembly Filtering
Scaffolds that were less than 500bp were removed from the assemblies.
Scaffolds that were less than 500bp were removed from the assemblies.
filtersubmit.sh | filter | run filterLen.py with the correct module and input settings to remove small scaffolds from genome assembly | genome assembly of interest | genome assembly excluding scaffolds < 500 bp |
filtersubmit.sh | filter | run filterLen.py with the correct module and input settings to remove small scaffolds from genome assembly | genome assembly of interest | genome assembly excluding scaffolds < 500 bp |
#### Softmasking Genome
#### Softmasking Genome
The repetitive regions of the genome were identified and softmasked using [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) and [RepeatMasker](http://www.repeatmasker.org/webrepeatmaskerhelp.html), respectively.
The repetitive regions of the genome were identified and softmasked using [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) and [RepeatMasker](http://www.repeatmasker.org/webrepeatmaskerhelp.html), respectively.
repeatModeler.sh | mask | Use [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) to generate a de novo repeats library for a genome | filtered genome assembly in fasta format | repeats library suffixed conseni.fa.classified |
repeatModeler.sh | mask | Use [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) to generate a de novo repeats library for a genome | filtered genome assembly in fasta format | repeats library suffixed conseni.fa.classified |
...
@@ -36,12 +39,14 @@ splitfasta.sh | mask | Reduce the running time for repeatmasker by splitting the
...
@@ -36,12 +39,14 @@ splitfasta.sh | mask | Reduce the running time for repeatmasker by splitting the
#### Genome Statistical Assessment
#### Genome Statistical Assessment
[QUAST](http://quast.bioinf.spbau.ru/manual.html) and [BUSCO](https://busco.ezlab.org/) were used to assess the assembly quality and completeness.
[QUAST](http://quast.bioinf.spbau.ru/manual.html) and [BUSCO](https://busco.ezlab.org/) were used to assess the assembly quality and completeness.
Processing the TSA files from NCBI to be used as evidence for genome annotation tool was accomplished by frame-selecting using [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) and clustering with [USearch (v9.0)](https://www.drive5.com/usearch/manual9/)
Processing the TSA files from NCBI to be used as evidence for genome annotation tool was accomplished by frame-selecting using [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) and clustering with [USearch (v9.0)](https://www.drive5.com/usearch/manual9/)
frameSelect.sh | TSA Prep | Uses [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) to identify coding regions in the transcript assemblies and translate into peptide sequences | TSA fasta file | BED, GFF3, CDS (nt coding sequence) & peptide files representing recovered coding regions |
frameSelect.sh | TSA Prep | Uses [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) to identify coding regions in the transcript assemblies and translate into peptide sequences | TSA fasta file | BED, GFF3, CDS (nt coding sequence) & peptide files representing recovered coding regions |