Skip to content
Finished adding scripts and descriptions authored by Alex Trouern-Trend's avatar Alex Trouern-Trend
......@@ -10,6 +10,7 @@ Table of Contents
* [Softmasking Genome](#softmasking-genome)
* [Genome Statistical Assessment](#genome-statistical-assessment)
* [Preparing TSA Files](#preparing-tsa-files)
* [Short Read Alignment](#short-read-alignment)
#### Gathering Data
......@@ -43,8 +44,9 @@ splitfasta.sh | mask | Reduce the running time for repeatmasker by splitting the
Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | -----
wholegenomebusco.sh | assembly stats | Use [BUSCO](https://busco.ezlab.org/) to assess genome completeness | Length-filtered/softmasked genome assembly & appropriate single-copy ortholog dataset | Genome completeness benchmark results including predicted genes, alignment results & statistics |
quast.sh | assembly stats | Uses QUAST to assess quality of genome assemblies | Genome before filtering/softmasking & Genome after filtering/softmasking | Assembly statistics for both inputs |
#### Preparing TSA Files
#### Preparing TSA
Processing the TSA files from NCBI to be used as evidence for genome annotation tool was accomplished by frame-selecting using [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) and clustering with [USearch (v9.0)](https://www.drive5.com/usearch/manual9/).
Name | Step | Purpose | Input | Expected Output | Notes
......@@ -52,6 +54,17 @@ Name | Step | Purpose | Input | Expected Output | Notes
frameSelect.sh | TSA Prep | Uses [TransDecoder](https://github.com/TransDecoder/TransDecoder/wiki) to identify coding regions in the transcript assemblies and translate into peptide sequences | TSA fasta file | BED, GFF3, CDS (nt coding sequence) & peptide files representing recovered coding regions |
usearch.sh | TSA Prep | Uses [USearch v9.0](https://www.drive5.com/usearch/manual9/) to cluster multiple frame-selected TSAs (**T**ranscriptome **S**hotgun **A**ssembly) by sequence homology into a consensus transcriptome | A single fasta made of concatenated frame-selected TSAs | Clustered reference transcriptome |
#### Evidence Alignment
Short-read and TSA evidence were aligned to genome assemblies using [HISAT2](https://ccb.jhu.edu/software/hisat2/manual.shtml) and [GMAP](http://research-pub.gene.com/gmap/src/README), respectively. Before alignment, short-reads evidence was trimmed QC'd using sickle and [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
Name | Step | Purpose | Input | Expected Output | Notes
---- | ---- | ------- | ----- | --------------- | -----
fastqc.sh | short-read QC | Uses FastQC to assess read quality | fastq files from short-read libraries | statistics on read quality in HTML |
sickle.sh | short-read QC | Uses Sickle to trim barcodes & adapters sequences and remove low quality reads | raw fastq files for short-read libraries | trimmed fastq files |
hisatBuild.sh | short-read align | Builds indices to be used by HISAT2 | Length filtered and softmasked genome in fasta format | Set of index files |
hisat.sh | short-read align | Runs HISAT2 short-read aligner | Path to directory contain index built using hisatBuild.sh & path to trimmed reads data | read alignments in SAM format |
convert.sh | short-read align | Uses [samtools](http://samtools.sourceforge.net/) to convert SAM files to the binary, BAM format | sam output of from running hisat.sh | BAM files of short-read alignments |
sort.sh | short-read align | uses samtools to sort BAM files, a prerequisite for merging | unsorted BAM files | sorted BAM files |
merge.sh | short-read align | merges sorted alignments from each short-read library into a single BAM file. | BAM files from each alignment | A single, merged BAM file |