Table of Contents
- proteinortho.tsv → html (proteinortho2html.pl)
- proteinortho.tsv → OrthoXML (proteinortho2xml.pl)
- proteinortho.tsv → tree (newick format) (proteinortho2tree.pl)
- protein ids/names or proteinortho.tsv → extract protein-sequences (fasta) (proteinortho_grab_proteins.pl)
- protein id/name → history (proteinortho_history.pl)
- two proteinortho-graph files → difference (proteinortho_compareProteinorthoGraphs.pl)
- proteinortho.tsv → singletons (proteinortho_singltons.pl)
- proteinortho-graph/blast-graph → species summary table (proteinortho_summary.pl)
Conversion Tools
proteinortho.tsv → html
tool: proteinortho2html.pl
USAGE: proteinortho2html.pl <myproject.proteinortho> (<fasta1> <fasta2> ...)
The first argument points to the proteinortho output (tsv)-file.
Any further (optional) files should be fasta files, for conversion of the identifier to a proper gene name/ describtion.
The HTML output is printed to stdout, use '>' to write the html output to a file.
proteinortho.tsv → OrthoXML
tool: proteinortho2xml.pl
proteinortho2xml.pl <PROTEINORTHOFILE>
Reads Proteinortho file (not proteinortho-graph file!) and produces the OrthoXML format (>stdout).
for more information about the OrthoXML format see orthoxml.org
proteinortho.tsv → tree (newick format)
tool: proteinortho2tree.pl
usage: proteinortho2tree.pl [OPTIONS] ORTHOMATRIX(.tsv) >OUTTREE
input: output from Proteinortho (version >5) e.g. myproject.proteinortho.tsv
output: corresponding UPGMA tree in newick format
options: -o=[FILE] prints output to the given file rather than STDOUT
extract results
protein ids/names or proteinortho.tsv → extract protein-sequences (fasta)
tool: proteinortho_grab_proteins.pl
proteinortho_grab_proteins.pl greps all genes/proteins of a given fasta file
SYNOPSIS
proteinortho_grab_proteins.pl (options) QUERY FASTA1 (FASTA2 ...)
QUERY proteinortho.tsv FILE or search STRING or '-' for STDIN:
a) proteinortho output file (.tsv). This uses by default the -exact option.
b) string of one identifier e.g. 'tr|asd3|asd' OR multiple identifier separated by ',' (-F=)
FASTA* fasta file(s) (database)
(options):
-tofiles, -t print everything to files instead of stdout files are called OrthoGroup**.fasta for a proteinortho.tsv file
-E enables regex matching otherwise the string is escaped (e.g. | -> \|)
-exact search patters are extended with a \b, that indicates end of word.
-source, -s adds the filename (FASTA1,...) to the found gene-name
-F=s char delimiter for multiple identifier if QUERY is a string input (default: ',')
More details and examples (Click to expand)
DESCRIPTION
This script finds and extract all given identifier of a list of fasta files.
The identifier can be provided as a simple string 'BDNF1', regex string 'BDNF*'
or in form of a proteinortho output file (myproject.proteinortho.tsv).
Example:
# 1. most simple call:
perl proteinortho_grab_proteins.pl 'BDNF1' *.faa
STDOUT:
>BDNF1 Brain derived neurotrophic factor OS=human(...)
MNNGGPTEMYYQQHMQSAGQPQQPQTVTSGPMSHYPPAQPPLLQPGQPYSHGAPSPYQYG
>BDNF15 Brain derived neurotrophic factor OS=human(...)
MAFPLHFSREPAHAIPSMKAPFSRHEVPFGRSPSMAIPNSETHDDVPPPLPPPRHPPCTN
The second hit BDNF15 is reported since it also contains 'BDNF1' as a substring.
To prevent such a behaviour use proteinortho_grab_proteins.pl -E 'BDNF1\b'.
The \b marks the end of a word and -E enables regex expressions.
Or simply add -exact: perl proteinortho_grab_proteins.pl -exact 'BDNF1' *.faa
# 2. multiple ids:
perl proteinortho_grab_proteins.pl 'BDNF1,BDNF2,BDNF3' *.faa
# 3. more complex regex search:
perl proteinortho_grab_proteins.pl -E 'B?DNF[0-3]3+' *.faa
This finds: BDNF13, BDNF23, DNF13, DNF033, ...
# 4. proteinortho tsv file and write output to files:
proteinortho_grab_proteins.pl -tofiles myproject.proteinortho.tsv test/*.faa
This will produce the files: OrthoGroup0.fasta, OrthoGroup1.fasta, OrthoGroup2.fasta, ...
Each fasta file contains all genes of one orthology group (one line in myproject.proteinortho.tsv)
protein id/name → history
tool: proteinortho_history.pl
SYNOPSIS
proteinortho_history.pl (-project=myproject) QUERY (FASTA1 FASTA2 ...)
QUERY A string of a single gene/protein or 2 separated by a comma or a whitespace (the input is escaped using quotemeta, use -noquotemeta to avoid this)
-project=MYPROJECT The project name (as specified in proteinortho with -project) (default:auto detect in the current directory)
-step=[123] (optional) If specified more optput is printed (to STDOUT) for the given step:
-step=1 : search for the given fasta sequence in the input fasta files
-step=2 : search in the *.blast-graph
-step=3 : search in the *.proteinortho file
-step=all : prints everything of above to STDOUT
FASTA* (optional) input fasta files
-noquotemeta, -E (optional) If set, then the query will not be escaped.
-plain, -p, -notableformat (optional) If -step= is set too, then the tables are not formatted and a plain csv is printed instead.
-delim= (optional) Defines the delimiter character for spliting the query (if you want to search for 2 genes/proteins)
NOTE: if you use the -keep option and you have the project_cache_proteinortho/ directory, this program additionally searches for all blast hits.
two proteinortho-graph files → difference
tool: proteinortho_compareProteinorthoGraphs.pl
Usage: proteinortho_compareProteinorthoGraphs.pl FILE_A FILE_B
Compares two Proteinortho-graph files and reports additional and different entrys.
D = different
O = only here
proteinortho.tsv → singletons
tool: proteinortho_singletons.pl
proteinortho_singletons.pl FASTA1 FASTA2 FASTAN <PROTEINORTHO_OUTFILE
Reads Proteinortho outfile and its source fasta files to determin entries which occure once only
proteinortho-graph/blast-graph → species summary table
how are the species connected given the proteinortho-graph/blast-graph
tool: proteinortho_summary.pl
proteinortho_summary.pl produces a summary on species level.
SYNOPSIS
proteinortho_summary.pl (options) GRAPH (GRAPH2)
GRAPH Path to the *.proteinortho-graph or *.blast-graph file generated by proteinortho.
GRAPH2 (optional) If you provide a blast-graph AND a proteinortho-graph, the difference is calculated (GRAPH - GRAPH2)
Note: The *.proteinortho.tsv file does not work here (use the proteinortho-graph file)
OPTIONS
-format,-f enables the table formatting instead of the plain csv output.
More details and examples (Click to expand)
$ proteinortho test/*faa
$ proteinortho_summary.pl myproject.proteinortho-graph
# The adjacency matrix, the number of edges between 2 species
# file C.faa C2.faa E.faa L.faa M.faa
C.faa 0 1 13 18 16
C2.faa 1 0 1 1 1
E.faa 13 1 0 14 15
L.faa 18 1 14 0 42
M.faa 16 1 15 42 0
# file average number of edges
C.faa 9.6
C2.faa 0.8
E.faa 8.6
L.faa 15
M.faa 14.8
# The 2-path matrix, the number of paths between 2 species of length 2
# file C.faa C2.faa E.faa L.faa M.faa
C.faa(0) 750 47 493 855 952
C2.faa(1) 94 4 42 74 73
E.faa(2) 986 84 591 865 797
L.faa(3) 1710 148 1730 2285 499
M.faa(4) 1904 146 1594 998 2246
# file average number of 2-paths
C.faa(0) 1088.8
C2.faa(1) 95.2
E.faa(2) 997
L.faa(3) 1374.2
M.faa(4) 1377.6
More details and examples (Click to expand)
$ proteinortho test/*faa
$ proteinortho_summary.pl myproject.proteinortho-graph myproject.blast-graph
# The adjacency matrix, the number of edges between 2 species
# file C.faa C2.faa E.faa L.faa M.faa
C.faa 0 0 -3 -2 -5
C2.faa 0 0 0 0 0
E.faa -3 0 0 -4 -8
L.faa -2 0 -4 0 -3
M.faa -5 0 -8 -3 0
# file average number of edges
C.faa -2
C2.faa 0
E.faa -3
L.faa -1.8
M.faa -3.2
# The 2-path matrix, the number of paths between 2 species of length 2
# file C.faa C2.faa E.faa L.faa M.faa
C.faa(0) 38 0 48 27 30
C2.faa(1) 0 0 0 0 0
E.faa(2) 96 0 89 30 27
L.faa(3) 54 0 60 29 42
M.faa(4) 60 0 54 84 98
# file average number of 2-paths
C.faa(0) 49.6
C2.faa(1) 0
E.faa(2) 59.8
L.faa(3) 45.4
M.faa(4) 59.2
Internal programs
the following programs are used by proteinortho internaly.
proteinortho_cleanupblastgraph : used if -checkblast is set
proteinortho_graphMinusRemovegraph : a clean up procedure to build the proteinortho-graph
proteinortho_clustering : the main clustering program (C++)
proteinortho_ffadj_mcs.py : the synteny program
proteinortho_formatUsearch.pl : a format conversion tool for usearch
proteinortho_do_mcl.pl : mcl clustering wrapper
proteinortho_treeBuilderCore : C++ part of the UPGMA algorithm of proteinortho2tree.pl