Update Scripts Descriptions authored by Alex Trouern-Trend's avatar Alex Trouern-Trend
......@@ -13,31 +13,35 @@ Table of Contents
#### Gathering Data
The genomic and transcriptomic data used for the experiment was sourced from NCBI
The genomic and transcriptomic data used for the experiment was sourced from NCBI for most of the species used. Other data were unpublished and sourced from collaborators.
Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | -----
fetchSRA.sh | gather data | Gather data from multiple NCBI SRRs (**S**equence **R**ead archive **R**un accessions) | Text file containing the specific SRR runs that you want to download | raw reads in .fastq format | the option `--split-files` in line 22 must be removed if gathered data is not paired
validateSRA.sh | gather data | Validates that the correct files have been downloaded from NCBI | Text file containing the specific SRR runs that you want to download | an output and error file confirming that all runs were downloaded correctly |
#### Assembly Filtering
Scaffolds that were less than 500bp were removed from the assemblies.
Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | -----
filtersubmit.sh | filter | run filterLen.py with the correct module and input settings to remove small scaffolds from genome assembly | genome assembly of interest | genome assembly excluding scaffolds < 500 bp |
#### Softmasking Genome
The repetitive regions of the genome were identified and softmasked using [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) and [RepeatMasker](http://www.repeatmasker.org/webrepeatmaskerhelp.html), respectively.
Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | -----
repeatModeler.sh | mask | Use [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) to generate a de novo repeats library for a genome | filtered genome assembly in fasta format | repeats library suffixed conseni.fa.classified |
concat.sh | mask | Concatenate all of the pieces of the split and softmasked genome | all sm pieces of the genome | Softmasked genome assembly |
repeatmasker.sh | mask | Use [RepeatMasker] to softmask the regions in the genome recognized as repetitive | a piece of the filtered genome, repeat library generated by [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) | softmasked piece of genome |
repeatmasker.sh | mask | Use [RepeatMasker](http://www.repeatmasker.org/webrepeatmaskerhelp.html) to softmask the regions in the genome recognized as repetitive | a piece of the filtered genome, repeat library generated by [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) | softmasked piece of genome |
splitfasta.sh | mask | Reduce the running time for repeatmasker by splitting the genome into pieces | filtered genome in fasta format | multiple pieces of filtered genome |
#### Genome Statistical Assessment
[QUAST](http://quast.bioinf.spbau.ru/manual.html) and [BUSCO](https://busco.ezlab.org/) were used to assess the assembly quality and completeness.
Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | -----
wholegenomebusco.sh | assembly stats | Use [BUSCO](https://busco.ezlab.org/) to assess genome completeness | Length-filtered/softmasked genome assembly & appropriate single-copy ortholog dataset | Genome completeness benchmark results including predicted genes, alignment results & statistics |
#### Preparing TSA Files
Processing the TSA files from NCBI to be used as evidence for genome annotation tool was accomplished by frame-selecting using [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) and clustering with [USearch (v9.0)](https://www.drive5.com/usearch/manual9/)
Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | -----
frameSelect.sh | TSA Prep | Uses [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) to identify coding regions in the transcript assemblies and translate into peptide sequences | TSA fasta file | BED, GFF3, CDS (nt coding sequence) & peptide files representing recovered coding regions |
......
......