Update Scripts Descriptions authored by Alex Trouern-Trend's avatar Alex Trouern-Trend
...@@ -4,20 +4,27 @@ Plant Computational Genomics Lab @ UConn ...@@ -4,20 +4,27 @@ Plant Computational Genomics Lab @ UConn
The scripts used in the processing of the supporting genomic and transcriptomic data are available in the appropriate subdirectories of the [processing repository](https://gitlab.com/PlantGenomicsLab/annotationtool/tree/master/process). These are kept in groups based on processing step. [This page](https://gitlab.com/PlantGenomicsLab/annotationtool/wikis/Scripts-Descriptions) (here) is an atlas of script function, listed alphabetically by script name within each step category, which are listed sequentially. The scripts used in the processing of the supporting genomic and transcriptomic data are available in the appropriate subdirectories of the [processing repository](https://gitlab.com/PlantGenomicsLab/annotationtool/tree/master/process). These are kept in groups based on processing step. [This page](https://gitlab.com/PlantGenomicsLab/annotationtool/wikis/Scripts-Descriptions) (here) is an atlas of script function, listed alphabetically by script name within each step category, which are listed sequentially.
Table of Contents Table of Contents
* [Gathering Data](#Gathering-Data) * [Gathering Data](#gathering-data)
* [TSA Prep](#Preparing-TSA-Files) * [Assembly Filtering](#assembly-filtering)
* [TSA Prep](#preparing-tsa-files)
* [Softmasking Genome](#softmasking-genome)
* [Genome Statistical Assessment](#genome-statistical-assessment)
* [Preparing TSA Files](#preparing-tsa-files)
#### Gathering Data #### Gathering Data
The genomic and transcriptomic data used for the experiment was sourced from NCBI
Name | Step | Purpose | Input | Expected Output | Notes Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | ----- ---- | ---- | ------- | ----- | --------------- | -----
fetchSRA.sh | gather data | Gather data from multiple NCBI SRRs (**S**equence **R**ead archive **R**un accessions) | Text file containing the specific SRR runs that you want to download | raw reads in .fastq format | the option `--split-files` in line 22 must be removed if gathered data is not paired fetchSRA.sh | gather data | Gather data from multiple NCBI SRRs (**S**equence **R**ead archive **R**un accessions) | Text file containing the specific SRR runs that you want to download | raw reads in .fastq format | the option `--split-files` in line 22 must be removed if gathered data is not paired
validateSRA.sh | gather data | Validates that the correct files have been downloaded from NCBI | Text file containing the specific SRR runs that you want to download | an output and error file confirming that all runs were downloaded correctly | validateSRA.sh | gather data | Validates that the correct files have been downloaded from NCBI | Text file containing the specific SRR runs that you want to download | an output and error file confirming that all runs were downloaded correctly |
#### Assembly Filtering
Name | Step | Purpose | Input | Expected Output | Notes Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | ----- ---- | ---- | ------- | ----- | --------------- | -----
filtersubmit.sh | filter | run filterLen.py with the correct module and input settings to remove small scaffolds from genome assembly | genome assembly of interest | genome assembly excluding scaffolds < 500 bp | filtersubmit.sh | filter | run filterLen.py with the correct module and input settings to remove small scaffolds from genome assembly | genome assembly of interest | genome assembly excluding scaffolds < 500 bp |
#### Softmasking Genome
Name | Step | Purpose | Input | Expected Output | Notes Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | ----- ---- | ---- | ------- | ----- | --------------- | -----
repeatModeler.sh | mask | Use [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) to generate a de novo repeats library for a genome | filtered genome assembly in fasta format | repeats library suffixed conseni.fa.classified | repeatModeler.sh | mask | Use [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) to generate a de novo repeats library for a genome | filtered genome assembly in fasta format | repeats library suffixed conseni.fa.classified |
...@@ -25,15 +32,16 @@ concat.sh | mask | Concatenate all of the pieces of the split and softmasked gen ...@@ -25,15 +32,16 @@ concat.sh | mask | Concatenate all of the pieces of the split and softmasked gen
repeatmasker.sh | mask | Use [RepeatMasker] to softmask the regions in the genome recognized as repetitive | a piece of the filtered genome, repeat library generated by [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) | softmasked piece of genome | repeatmasker.sh | mask | Use [RepeatMasker] to softmask the regions in the genome recognized as repetitive | a piece of the filtered genome, repeat library generated by [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) | softmasked piece of genome |
splitfasta.sh | mask | Reduce the running time for repeatmasker by splitting the genome into pieces | filtered genome in fasta format | multiple pieces of filtered genome | splitfasta.sh | mask | Reduce the running time for repeatmasker by splitting the genome into pieces | filtered genome in fasta format | multiple pieces of filtered genome |
#### Genome Statistical Assessment
Name | Step | Purpose | Input | Expected Output | Notes Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | ----- ---- | ---- | ------- | ----- | --------------- | -----
wholegenomebusco.sh | assembly stats | Use [BUSCO](https://busco.ezlab.org/) to assess genome completeness | Length-filtered/softmasked genome assembly & appropriate single-copy ortholog dataset | Genome completeness benchmark results including predicted genes, alignment results & statistics | wholegenomebusco.sh | assembly stats | Use [BUSCO](https://busco.ezlab.org/) to assess genome completeness | Length-filtered/softmasked genome assembly & appropriate single-copy ortholog dataset | Genome completeness benchmark results including predicted genes, alignment results & statistics |
##### Preparing TSA Files #### Preparing TSA Files
Name | Step | Purpose | Input | Expected Output | Notes Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | ----- ---- | ---- | ------- | ----- | --------------- | -----
frameSelect.sh | TSA Prep | Uses [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) to identify coding regions in the transcript assemblies and translate into peptide sequences | TSA fasta file | BED, GFF3, CDS (nt coding sequence) & peptide files representing recovered coding regions | frameSelect.sh | TSA Prep | Uses [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) to identify coding regions in the transcript assemblies and translate into peptide sequences | TSA fasta file | BED, GFF3, CDS (nt coding sequence) & peptide files representing recovered coding regions |
usearch.sh | TSA Prep | Uses [USearch v9.0](https://www.drive5.com/usearch/manual9/) to cluster multiple frame-selected TSAs (**T**ranscriptome **S**hotgun **A**ssembly) by sequence homology into a consensus transcriptome | A single fasta made of concatenated frame-selected TSAs | Clustered reference transcriptome | . usearch.sh | TSA Prep | Uses [USearch v9.0](https://www.drive5.com/usearch/manual9/) to cluster multiple frame-selected TSAs (**T**ranscriptome **S**hotgun **A**ssembly) by sequence homology into a consensus transcriptome | A single fasta made of concatenated frame-selected TSAs | Clustered reference transcriptome |
... ...
......