Update Scripts Descriptions authored by Alex Trouern-Trend's avatar Alex Trouern-Trend
......@@ -4,20 +4,27 @@ Plant Computational Genomics Lab @ UConn
The scripts used in the processing of the supporting genomic and transcriptomic data are available in the appropriate subdirectories of the [processing repository](https://gitlab.com/PlantGenomicsLab/annotationtool/tree/master/process). These are kept in groups based on processing step. [This page](https://gitlab.com/PlantGenomicsLab/annotationtool/wikis/Scripts-Descriptions) (here) is an atlas of script function, listed alphabetically by script name within each step category, which are listed sequentially.
Table of Contents
* [Gathering Data](#Gathering-Data)
* [TSA Prep](#Preparing-TSA-Files)
* [Gathering Data](#gathering-data)
* [Assembly Filtering](#assembly-filtering)
* [TSA Prep](#preparing-tsa-files)
* [Softmasking Genome](#softmasking-genome)
* [Genome Statistical Assessment](#genome-statistical-assessment)
* [Preparing TSA Files](#preparing-tsa-files)
#### Gathering Data
The genomic and transcriptomic data used for the experiment was sourced from NCBI
Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | -----
fetchSRA.sh | gather data | Gather data from multiple NCBI SRRs (**S**equence **R**ead archive **R**un accessions) | Text file containing the specific SRR runs that you want to download | raw reads in .fastq format | the option `--split-files` in line 22 must be removed if gathered data is not paired
validateSRA.sh | gather data | Validates that the correct files have been downloaded from NCBI | Text file containing the specific SRR runs that you want to download | an output and error file confirming that all runs were downloaded correctly |
#### Assembly Filtering
Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | -----
filtersubmit.sh | filter | run filterLen.py with the correct module and input settings to remove small scaffolds from genome assembly | genome assembly of interest | genome assembly excluding scaffolds < 500 bp |
#### Softmasking Genome
Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | -----
repeatModeler.sh | mask | Use [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) to generate a de novo repeats library for a genome | filtered genome assembly in fasta format | repeats library suffixed conseni.fa.classified |
......@@ -25,15 +32,16 @@ concat.sh | mask | Concatenate all of the pieces of the split and softmasked gen
repeatmasker.sh | mask | Use [RepeatMasker] to softmask the regions in the genome recognized as repetitive | a piece of the filtered genome, repeat library generated by [RepeatModeler](http://www.repeatmasker.org/RepeatModeler/) | softmasked piece of genome |
splitfasta.sh | mask | Reduce the running time for repeatmasker by splitting the genome into pieces | filtered genome in fasta format | multiple pieces of filtered genome |
#### Genome Statistical Assessment
Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | -----
wholegenomebusco.sh | assembly stats | Use [BUSCO](https://busco.ezlab.org/) to assess genome completeness | Length-filtered/softmasked genome assembly & appropriate single-copy ortholog dataset | Genome completeness benchmark results including predicted genes, alignment results & statistics |
##### Preparing TSA Files
#### Preparing TSA Files
Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | -----
frameSelect.sh | TSA Prep | Uses [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) to identify coding regions in the transcript assemblies and translate into peptide sequences | TSA fasta file | BED, GFF3, CDS (nt coding sequence) & peptide files representing recovered coding regions |
usearch.sh | TSA Prep | Uses [USearch v9.0](https://www.drive5.com/usearch/manual9/) to cluster multiple frame-selected TSAs (**T**ranscriptome **S**hotgun **A**ssembly) by sequence homology into a consensus transcriptome | A single fasta made of concatenated frame-selected TSAs | Clustered reference transcriptome | .
usearch.sh | TSA Prep | Uses [USearch v9.0](https://www.drive5.com/usearch/manual9/) to cluster multiple frame-selected TSAs (**T**ranscriptome **S**hotgun **A**ssembly) by sequence homology into a consensus transcriptome | A single fasta made of concatenated frame-selected TSAs | Clustered reference transcriptome |
......
......