... | ... | @@ -44,7 +44,7 @@ The gene predictor [GlimmerHMM](#glimmerhmm) is trained on the different genome |
|
|
|
|
|
#### Prediction with SNAP using MAKER
|
|
|
|
|
|
To be trained, the gene predictor [SNAP](#snap) needs a GFF file generated by [MAKER](#maker). It is generated using the assembly to annotate, the stranded transcriptome assembly, and protein sequences (Viridiplantae and eudicotyledone protein sequences). They are given to MAKER and an initial alignment is performed by [BLAST](#blast) and [Exonerate](#exonerate). Then, an _ab initio_ gene prediction is done with MAKER. At this step, a GFF file is generated. An HMM file is generated by using this GFF file to train SNAP. A second MAKER run is performed with enabled SNAP gene prediction and with the SNAP HMM file given in input.
|
|
|
To be trained, the gene predictor SNAP needs a GFF file generated by [MAKER](#maker). It is generated using the assembly to annotate, the stranded transcriptome assembly, and protein sequences (Viridiplantae and eudicotyledone protein sequences). They are given to MAKER and an initial alignment is performed by BLAST and [Exonerate](#exonerate). Then, an _ab initio_ gene prediction is done with MAKER. At this step, a GFF file is generated. An HMM file is generated by using this GFF file to train SNAP. A second MAKER run is performed with enabled SNAP gene prediction and with the SNAP HMM file given in input.
|
|
|
|
|
|
The HMM file generation and the SNAP prediction with MAKER and the new HMM file is repeated.
|
|
|
|
... | ... | @@ -364,17 +364,54 @@ glimmerhmm -n 1 -g -o <out.gff> <file.fasta> path/to/training/dir/ |
|
|
|
|
|
### MAKER
|
|
|
|
|
|
**Publications**: [MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes](https://pubmed.ncbi.nlm.nih.gov/18025269/) and [MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects](https://pubmed.ncbi.nlm.nih.gov/22192575/)
|
|
|
**Maker publications**: [MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes](https://pubmed.ncbi.nlm.nih.gov/18025269/) and [MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects](https://pubmed.ncbi.nlm.nih.gov/22192575/)
|
|
|
|
|
|
**Source code**: [http://www.yandell-lab.org/software/maker.html](http://www.yandell-lab.org/software/maker.html)
|
|
|
**Maker source code**: [http://www.yandell-lab.org/software/maker.html](http://www.yandell-lab.org/software/maker.html)
|
|
|
|
|
|
**SNAP source code**: [https://github.com/KorfLab/SNAP](https://github.com/KorfLab/SNAP)
|
|
|
|
|
|
MAKER is an annotation pipeline, not a gene predictor. MAKER does not predict genes, rather MAKER leverages existing software tools (some of which are gene predictors) and integrates their output to produce what MAKER finds to be the best possible gene model for a given location based on evidence alignments.
|
|
|
|
|
|
MAKER uses 3 different control files that can be generated with the command `maker -CTL`: _maker_opts.ctl_, _maker_exe.ctl_ and _maker_evm.ctl_. The main configuration file is the _maker_opts.ctl_, where we can set the location of the genome, transcript (EST) and protein input files. Other option will be defined like:
|
|
|
#### Configuration files
|
|
|
|
|
|
MAKER uses 3 different control files that can be generated with the command `maker -CTL`: _maker_opts.ctl_, _maker_exe.ctl_ and _maker_evm.ctl_. The main configuration file is the _maker_opts.ctl_, where we can set the location of the genome, transcript (EST) and protein input files. Other options will be defined like:
|
|
|
|
|
|
- _max\_dna\_len=300000_: length for dividing up contigs into chunks (increases/decreases memory usage)
|
|
|
- _split\_hit=20000_: length for the splitting of hits (expected max intron size for evidence alignments)
|
|
|
|
|
|
Depending on the analysis, different options can be changed. To predict annotations based on transcripts (EST) and protein, the options _est2genome_ and _protein2genome_ values are set to 1. To enable SNAP prediction a file must be given to the variable _snaphmm_.
|
|
|
|
|
|
#### Run MAKER
|
|
|
|
|
|
Then, MAKER can be run using the following command. It will read the configuration files that are located in the current directory and will generate multiple GFF files in output:
|
|
|
|
|
|
```bash
|
|
|
maker -c 24 -base prefix_name
|
|
|
```
|
|
|
|
|
|
**Arguments**:
|
|
|
|
|
|
- _-c_: number of threads
|
|
|
- _-base_: output files prefix
|
|
|
|
|
|
#### Train SNAP
|
|
|
|
|
|
Before predicting annotations with SNAP, the tool must be trained with a GFF file containing annotations. In this pipeline, these annotations were predicted with a first MAKER run using evidence data.
|
|
|
|
|
|
First, the sequences are splitted into fragments with one gene per sequence (there will be up to 1000 bp on either side of the genes). Then, convert the uni genes to plus stranded and use the parameter estimation program (forge):
|
|
|
|
|
|
```bash
|
|
|
fathom genome.ann genome.dna -categorize 1000
|
|
|
fathom uni.ann uni.dna -export 1000 -plus
|
|
|
forge export.ann export.dna
|
|
|
```
|
|
|
|
|
|
Finally, the HMM file can be generated:
|
|
|
|
|
|
```bash
|
|
|
hmm-assembler.pl organism output_dir > output.hmm
|
|
|
```
|
|
|
|
|
|
### BLAST
|
|
|
|
|
|
### Augustus
|
... | ... | |
... | ... | |