|
|
# Table of Contents
|
|
|
|
|
|
* [Conversion Tools](Conversion Tools)
|
|
|
|
|
|
|
|
|
<h1>Conversion Tools</h1>
|
|
|
<h3>proteinortho.tsv → html</h3>
|
|
|
|
|
|
tool: **proteinortho2html.pl**
|
|
|
|
|
|
```
|
|
|
USAGE: proteinortho2html.pl <myproject.proteinortho> (<fasta1> <fasta2> ...)
|
|
|
The first argument points to the proteinortho output (tsv)-file.
|
|
|
Any further (optional) files should be fasta files, for conversion of the identifier to a proper gene name/ describtion.
|
|
|
The HTML output is printed to stdout, use '>' to write the html output to a file.
|
|
|
```
|
|
|
|
|
|
---
|
|
|
<br>
|
|
|
|
|
|
<h3>proteinortho.tsv → OrthoXML</h3>
|
|
|
|
|
|
tool: **proteinortho2xml.pl**
|
|
|
|
|
|
```
|
|
|
proteinortho2xml.pl <PROTEINORTHOFILE>
|
|
|
Reads Proteinortho file (not proteinortho-graph file!) and produces the OrthoXML format (>stdout).
|
|
|
```
|
|
|
|
|
|
for more information about the OrthoXML format see [orthoxml.org](http://www.orthoxml.org/xml/Main.html)
|
|
|
|
|
|
---
|
|
|
<br>
|
|
|
|
|
|
<h3>proteinortho.tsv → tree (newick format)</h3>
|
|
|
|
|
|
tool: **proteinortho2tree.pl**
|
|
|
|
|
|
```
|
|
|
usage: proteinortho2tree.pl [OPTIONS] ORTHOMATRIX(.tsv) >OUTTREE
|
|
|
input: output from Proteinortho (version >5) e.g. myproject.proteinortho.tsv
|
|
|
output: corresponding UPGMA tree in newick format
|
|
|
options: -o=[FILE] prints output to the given file rather than STDOUT
|
|
|
```
|
|
|
|
|
|
---
|
|
|
<br>
|
|
|
|
|
|
|
|
|
<h1>extract results</h1>
|
|
|
|
|
|
<h3>protein ids/names or proteinortho.tsv → extract protein-sequences (fasta)</h3>
|
|
|
|
|
|
tool: **proteinortho_grab_proteins.pl**
|
|
|
|
|
|
```
|
|
|
proteinortho_grab_proteins.pl greps all genes/proteins of a given fasta file
|
|
|
|
|
|
SYNOPSIS
|
|
|
|
|
|
proteinortho_grab_proteins.pl (options) QUERY FASTA1 (FASTA2 ...)
|
|
|
|
|
|
QUERY proteinortho.tsv FILE or search STRING or '-' for STDIN:
|
|
|
a) proteinortho output file (.tsv). This uses by default the -exact option.
|
|
|
b) string of one identifier e.g. 'tr|asd3|asd' OR multiple identifier separated by ',' (-F=)
|
|
|
FASTA* fasta file(s) (database)
|
|
|
|
|
|
(options):
|
|
|
-tofiles, -t print everything to files instead of stdout files are called OrthoGroup**.fasta for a proteinortho.tsv file
|
|
|
-E enables regex matching otherwise the string is escaped (e.g. | -> \|)
|
|
|
-exact search patters are extended with a \b, that indicates end of word.
|
|
|
-source, -s adds the filename (FASTA1,...) to the found gene-name
|
|
|
-F=s char delimiter for multiple identifier if QUERY is a string input (default: ',')
|
|
|
```
|
|
|
|
|
|
<details>
|
|
|
<summary>More details and examples (Click to expand)</summary>
|
|
|
|
|
|
```
|
|
|
DESCRIPTION
|
|
|
|
|
|
This script finds and extract all given identifier of a list of fasta files.
|
|
|
The identifier can be provided as a simple string 'BDNF1', regex string 'BDNF*'
|
|
|
or in form of a proteinortho output file (myproject.proteinortho.tsv).
|
|
|
|
|
|
Example:
|
|
|
|
|
|
# 1. most simple call:
|
|
|
|
|
|
perl proteinortho_grab_proteins.pl 'BDNF1' *.faa
|
|
|
|
|
|
STDOUT:
|
|
|
>BDNF1 Brain derived neurotrophic factor OS=human(...)
|
|
|
MNNGGPTEMYYQQHMQSAGQPQQPQTVTSGPMSHYPPAQPPLLQPGQPYSHGAPSPYQYG
|
|
|
>BDNF15 Brain derived neurotrophic factor OS=human(...)
|
|
|
MAFPLHFSREPAHAIPSMKAPFSRHEVPFGRSPSMAIPNSETHDDVPPPLPPPRHPPCTN
|
|
|
|
|
|
The second hit BDNF15 is reported since it also contains 'BDNF1' as a substring.
|
|
|
To prevent such a behaviour use proteinortho_grab_proteins.pl -E 'BDNF1\b'.
|
|
|
The \b marks the end of a word and -E enables regex expressions.
|
|
|
|
|
|
Or simply add -exact: perl proteinortho_grab_proteins.pl -exact 'BDNF1' *.faa
|
|
|
|
|
|
# 2. multiple ids:
|
|
|
|
|
|
perl proteinortho_grab_proteins.pl 'BDNF1,BDNF2,BDNF3' *.faa
|
|
|
|
|
|
# 3. more complex regex search:
|
|
|
|
|
|
perl proteinortho_grab_proteins.pl -E 'B?DNF[0-3]3+' *.faa
|
|
|
|
|
|
This finds: BDNF13, BDNF23, DNF13, DNF033, ...
|
|
|
|
|
|
# 4. proteinortho tsv file and write output to files:
|
|
|
|
|
|
proteinortho_grab_proteins.pl -tofiles myproject.proteinortho.tsv test/*.faa
|
|
|
|
|
|
This will produce the files: OrthoGroup0.fasta, OrthoGroup1.fasta, OrthoGroup2.fasta, ...
|
|
|
Each fasta file contains all genes of one orthology group (one line in myproject.proteinortho.tsv)
|
|
|
```
|
|
|
|
|
|
</details>
|
|
|
|
|
|
|
|
|
---
|
|
|
<br>
|
|
|
|
|
|
|
|
|
<h3>protein id/name → history</h3>
|
|
|
|
|
|
tool: **proteinortho_history.pl**
|
|
|
|
|
|
```
|
|
|
proteinortho_history.pl reports the history of a (or a pair of) gene/protein(s).
|
|
|
|
|
|
SYNOPSIS
|
|
|
|
|
|
proteinortho_history.pl (-project=myproject) QUERY (FASTA1 FASTA2 ...)
|
|
|
|
|
|
QUERY A string of a single gene/protein or 2 separated by a comma or a whitespace (the input is escaped using quotemeta, use -noquotemeta to avoid this)
|
|
|
|
|
|
-project=MYPROJECT The project name (as specified in proteinortho with -project) (default:myproject)
|
|
|
-step=[123] (optional) If specified more optput is printed (to STDOUT) for the given step:
|
|
|
-step=1 : search for the given fasta sequence in the input fasta files
|
|
|
-step=2 : search in the *.blast-graph
|
|
|
-step=3 : search in the *.proteinortho file (default:nothing)
|
|
|
-step=all : prints everything of above to STDOUT
|
|
|
FASTA* (optional) input fasta files
|
|
|
-noquotemeta (optional) If set, then the query will not be escaped.
|
|
|
-delim= (optional) Defines the delimiter character for spliting the query (if you want to search for 2 genes/proteins)
|
|
|
|
|
|
NOTE: if you use the -keep option and you have the project_cache_proteinortho/ directory, this program additionally searches for all blast hits.
|
|
|
```
|
|
|
|
|
|
---
|
|
|
<br>
|
|
|
|
|
|
|
|
|
<h3>two proteinortho-graph files → difference</h3>
|
|
|
|
|
|
tool: **proteinortho_compareProteinorthoGraphs.pl**
|
|
|
|
|
|
```
|
|
|
Usage: proteinortho_compareProteinorthoGraphs.pl FILE_A FILE_B
|
|
|
|
|
|
Compares two Proteinortho-graph files and reports additional and different entrys.
|
|
|
D = different
|
|
|
O = only here
|
|
|
```
|
|
|
|
|
|
---
|
|
|
<br>
|
|
|
|
|
|
<h3>proteinortho.tsv → singles</h3>
|
|
|
|
|
|
tool: **proteinortho_history.pl**
|
|
|
|
|
|
```
|
|
|
proteinortho_singletons.pl FASTA1 FASTA2 FASTAN <PROTEINORTHO_OUTFILE
|
|
|
Reads Proteinortho outfile and its source fasta files to determin entries which occure once only
|
|
|
```
|
|
|
|
|
|
---
|
|
|
<br>
|
|
|
|
|
|
<h1>Internal programs</h1>
|
|
|
|
|
|
the following programs are used by proteinortho internaly.
|
|
|
|
|
|
```
|
|
|
proteinortho_cleanupblastgraph : used if -checkblast is set
|
|
|
proteinortho_graphMinusRemovegraph : a clean up procedure to build the proteinortho-graph
|
|
|
proteinortho_clustering : the main clustering program (C++)
|
|
|
proteinortho_ffadj_mcs.py : the synteny program
|
|
|
proteinortho_formatUsearch.pl : a format conversion tool for usearch
|
|
|
proteinortho_do_mcl.pl : mcl clustering wrapper
|
|
|
proteinortho_treeBuilderCore : C++ part of the UPGMA algorithm of proteinortho2tree.pl
|
|
|
``` |