Update Scripts Descriptions authored by Alex Trouern-Trend's avatar Alex Trouern-Trend
...@@ -14,6 +14,7 @@ Table of Contents ...@@ -14,6 +14,7 @@ Table of Contents
#### Gathering Data #### Gathering Data
The genomic and transcriptomic data used for the experiment was sourced from NCBI for most of the species used. Other data were unpublished and sourced from collaborators. The genomic and transcriptomic data used for the experiment was sourced from NCBI for most of the species used. Other data were unpublished and sourced from collaborators.
Name | Step | Purpose | Input | Expected Output | Notes Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | ----- ---- | ---- | ------- | ----- | --------------- | -----
fetchSRA.sh | gather data | Gather data from multiple NCBI SRRs (**S**equence **R**ead archive **R**un accessions) | Text file containing the specific SRR runs that you want to download | raw reads in .fastq format | the option `--split-files` in line 22 must be removed if gathered data is not paired fetchSRA.sh | gather data | Gather data from multiple NCBI SRRs (**S**equence **R**ead archive **R**un accessions) | Text file containing the specific SRR runs that you want to download | raw reads in .fastq format | the option `--split-files` in line 22 must be removed if gathered data is not paired
...@@ -21,12 +22,14 @@ validateSRA.sh | gather data | Validates that the correct files have been downl ...@@ -21,12 +22,14 @@ validateSRA.sh | gather data | Validates that the correct files have been downl
#### Assembly Filtering #### Assembly Filtering
Scaffolds that were less than 500bp were removed from the assemblies. Scaffolds that were less than 500bp were removed from the assemblies.
Name | Step | Purpose | Input | Expected Output | Notes Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | ----- ---- | ---- | ------- | ----- | --------------- | -----
filtersubmit.sh | filter | run filterLen.py with the correct module and input settings to remove small scaffolds from genome assembly | genome assembly of interest | genome assembly excluding scaffolds < 500 bp | filtersubmit.sh | filter | run filterLen.py with the correct module and input settings to remove small scaffolds from genome assembly | genome assembly of interest | genome assembly excluding scaffolds < 500 bp |
#### Softmasking Genome #### Softmasking Genome
The repetitive regions of the genome were identified and softmasked using [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) and [RepeatMasker](http://www.repeatmasker.org/webrepeatmaskerhelp.html), respectively. The repetitive regions of the genome were identified and softmasked using [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) and [RepeatMasker](http://www.repeatmasker.org/webrepeatmaskerhelp.html), respectively.
Name | Step | Purpose | Input | Expected Output | Notes Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | ----- ---- | ---- | ------- | ----- | --------------- | -----
repeatModeler.sh | mask | Use [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) to generate a de novo repeats library for a genome | filtered genome assembly in fasta format | repeats library suffixed conseni.fa.classified | repeatModeler.sh | mask | Use [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) to generate a de novo repeats library for a genome | filtered genome assembly in fasta format | repeats library suffixed conseni.fa.classified |
...@@ -36,12 +39,14 @@ splitfasta.sh | mask | Reduce the running time for repeatmasker by splitting the ...@@ -36,12 +39,14 @@ splitfasta.sh | mask | Reduce the running time for repeatmasker by splitting the
#### Genome Statistical Assessment #### Genome Statistical Assessment
[QUAST](http://quast.bioinf.spbau.ru/manual.html) and [BUSCO](https://busco.ezlab.org/) were used to assess the assembly quality and completeness. [QUAST](http://quast.bioinf.spbau.ru/manual.html) and [BUSCO](https://busco.ezlab.org/) were used to assess the assembly quality and completeness.
Name | Step | Purpose | Input | Expected Output | Notes Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | ----- ---- | ---- | ------- | ----- | --------------- | -----
wholegenomebusco.sh | assembly stats | Use [BUSCO](https://busco.ezlab.org/) to assess genome completeness | Length-filtered/softmasked genome assembly & appropriate single-copy ortholog dataset | Genome completeness benchmark results including predicted genes, alignment results & statistics | wholegenomebusco.sh | assembly stats | Use [BUSCO](https://busco.ezlab.org/) to assess genome completeness | Length-filtered/softmasked genome assembly & appropriate single-copy ortholog dataset | Genome completeness benchmark results including predicted genes, alignment results & statistics |
#### Preparing TSA Files #### Preparing TSA Files
Processing the TSA files from NCBI to be used as evidence for genome annotation tool was accomplished by frame-selecting using [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) and clustering with [USearch (v9.0)](https://www.drive5.com/usearch/manual9/) Processing the TSA files from NCBI to be used as evidence for genome annotation tool was accomplished by frame-selecting using [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) and clustering with [USearch (v9.0)](https://www.drive5.com/usearch/manual9/)
Name | Step | Purpose | Input | Expected Output | Notes Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | ----- ---- | ---- | ------- | ----- | --------------- | -----
frameSelect.sh | TSA Prep | Uses [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) to identify coding regions in the transcript assemblies and translate into peptide sequences | TSA fasta file | BED, GFF3, CDS (nt coding sequence) & peptide files representing recovered coding regions | frameSelect.sh | TSA Prep | Uses [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) to identify coding regions in the transcript assemblies and translate into peptide sequences | TSA fasta file | BED, GFF3, CDS (nt coding sequence) & peptide files representing recovered coding regions |
... ...
......