Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
    • Switch to GitLab Next
  • Sign in / Register
proteinortho
proteinortho
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
    • Iterations
  • Merge Requests 0
    • Merge Requests 0
  • Requirements
    • Requirements
    • List
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Security & Compliance
    • Security & Compliance
    • Dependency List
    • License Compliance
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI / CD
    • Code Review
    • Insights
    • Issue
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • PHD
  • proteinorthoproteinortho
  • Wiki
  • Tools and additional programs

Last edited by Paul Klemm Dec 11, 2019
Page history

Tools and additional programs

Table of Contents

Conversion Tools

  • proteinortho.tsv → html (proteinortho2html.pl)
  • proteinortho.tsv → OrthoXML (proteinortho2xml.pl)
  • proteinortho.tsv → tree (newick format) (proteinortho2tree.pl)

Extract Results

  • protein ids/names or proteinortho.tsv → extract protein-sequences (fasta) (proteinortho_grab_proteins.pl)
  • protein id/name → history (proteinortho_history.pl)
  • two proteinortho-graph files → difference (proteinortho_compareProteinorthoGraphs.pl)
  • proteinortho.tsv → singletons (proteinortho_singltons.pl)
  • proteinortho-graph/blast-graph → species summary table (proteinortho_summary.pl)

Internal programs



Conversion Tools

proteinortho.tsv → html

tool: proteinortho2html.pl

USAGE: proteinortho2html.pl <myproject.proteinortho> (<fasta1> <fasta2> ...)
The first argument points to the proteinortho output (tsv)-file. 
Any further (optional) files should be fasta files, for conversion of the identifier to a proper gene name/ describtion. 
The HTML output is printed to stdout, use '>' to write the html output to a file.



proteinortho.tsv → OrthoXML

tool: proteinortho2xml.pl

proteinortho2xml.pl <PROTEINORTHOFILE>
Reads Proteinortho file (not proteinortho-graph file!) and produces the OrthoXML format (>stdout).

for more information about the OrthoXML format see orthoxml.org




proteinortho.tsv → tree (newick format)

tool: proteinortho2tree.pl

usage:   proteinortho2tree.pl [OPTIONS] ORTHOMATRIX(.tsv) >OUTTREE
input:   output from Proteinortho (version >5) e.g. myproject.proteinortho.tsv 
output:  corresponding UPGMA tree in newick format
options: -o=[FILE]  prints output to the given file rather than STDOUT  



extract results

protein ids/names or proteinortho.tsv → extract protein-sequences (fasta)

tool: proteinortho_grab_proteins.pl

proteinortho_grab_proteins.pl        greps all genes/proteins of a given fasta file
 
SYNOPSIS
 
proteinortho_grab_proteins.pl (options) QUERY FASTA1 (FASTA2 ...)

	QUERY	proteinortho.tsv FILE or search STRING or '-' for STDIN:
		a)	proteinortho output file (.tsv). This uses by default the -exact option.
		b)	string of one identifier e.g. 'tr|asd3|asd' OR multiple identifier separated by ',' (-F=)
	FASTA*	fasta file(s) (database)

	(options):
		-tofiles, -t  print everything to files instead of stdout files are called OrthoGroup**.fasta for a proteinortho.tsv file
		-E            enables regex matching otherwise the string is escaped (e.g. | -> \|)
		-exact        search patters are extended with a \b, that indicates end of word.
		-source, -s   adds the filename (FASTA1,...) to the found gene-name
		-F=s          char delimiter for multiple identifier if QUERY is a string input (default: ',')
More details and examples (Click to expand)
DESCRIPTION
 
	This script finds and extract all given identifier of a list of fasta files. 
	The identifier can be provided as a simple string 'BDNF1', regex string 'BDNF*' 
	or in form of a proteinortho output file (myproject.proteinortho.tsv).
       
	Example:
 
 	# 1. most simple call:

	perl proteinortho_grab_proteins.pl 'BDNF1' *.faa

		STDOUT:
			>BDNF1 Brain derived neurotrophic factor OS=human(...)
			MNNGGPTEMYYQQHMQSAGQPQQPQTVTSGPMSHYPPAQPPLLQPGQPYSHGAPSPYQYG
			>BDNF15 Brain derived neurotrophic factor OS=human(...)
			MAFPLHFSREPAHAIPSMKAPFSRHEVPFGRSPSMAIPNSETHDDVPPPLPPPRHPPCTN

	    The second hit BDNF15 is reported since it also contains 'BDNF1' as a substring. 
	    To prevent such a behaviour use proteinortho_grab_proteins.pl -E 'BDNF1\b'. 
	    The \b marks the end of a word and -E enables regex expressions.

	    Or simply add -exact: perl proteinortho_grab_proteins.pl -exact 'BDNF1' *.faa

 	# 2. multiple ids:

	perl proteinortho_grab_proteins.pl 'BDNF1,BDNF2,BDNF3' *.faa

 	# 3. more complex regex search:

	perl proteinortho_grab_proteins.pl -E 'B?DNF[0-3]3+' *.faa

		This finds: BDNF13, BDNF23, DNF13, DNF033, ... 

 	# 4. proteinortho tsv file and write output to files:

	proteinortho_grab_proteins.pl -tofiles myproject.proteinortho.tsv test/*.faa

		This will produce the files: OrthoGroup0.fasta, OrthoGroup1.fasta, OrthoGroup2.fasta, ...
		Each fasta file contains all genes of one orthology group (one line in myproject.proteinortho.tsv)



protein id/name → history

tool: proteinortho_history.pl

 
SYNOPSIS
 
proteinortho_history.pl (-project=myproject) QUERY (FASTA1 FASTA2 ...)

	QUERY	A string of a single gene/protein or 2 separated by a comma or a whitespace (the input is escaped using quotemeta, use -noquotemeta to avoid this)

	-project=MYPROJECT	The project name (as specified in proteinortho with -project) (default:auto detect in the current directory)
	-step=[123] 		(optional) If specified more optput is printed (to STDOUT) for the given step:
		-step=1 : search for the given fasta sequence in the input fasta files
		-step=2 : search in the *.blast-graph
		-step=3 : search in the *.proteinortho file 
		-step=all : prints everything of above to STDOUT
	FASTA*						(optional) input fasta files 
	-noquotemeta, -E			(optional) If set, then the query will not be escaped.
	-plain, -p, -notableformat	(optional) If -step= is set too, then the tables are not formatted and a plain csv is printed instead. 
	-delim= 					(optional) Defines the delimiter character for spliting the query (if you want to search for 2 genes/proteins)

	NOTE: if you use the -keep option and you have the project_cache_proteinortho/ directory, this program additionally searches for all blast hits.



two proteinortho-graph files → difference

tool: proteinortho_compareProteinorthoGraphs.pl

Usage: proteinortho_compareProteinorthoGraphs.pl FILE_A FILE_B

Compares two Proteinortho-graph files and reports additional and different entrys.
 D = different
 O = only here



proteinortho.tsv → singletons

tool: proteinortho_singletons.pl

proteinortho_singletons.pl FASTA1 FASTA2 FASTAN <PROTEINORTHO_OUTFILE
Reads Proteinortho outfile and its source fasta files to determin entries which occure once only



proteinortho-graph/blast-graph → species summary table

how are the species connected given the proteinortho-graph/blast-graph

tool: proteinortho_summary.pl

proteinortho_summary.pl        produces a summary on species level.
 
SYNOPSIS
 
proteinortho_summary.pl (options) GRAPH (GRAPH2)

	GRAPH	Path to the *.proteinortho-graph or *.blast-graph file generated by proteinortho. 
	GRAPH2	(optional) If you provide a blast-graph AND a proteinortho-graph, the difference is calculated (GRAPH - GRAPH2)

	Note: The *.proteinortho.tsv file does not work here (use the proteinortho-graph file)

	OPTIONS

		-format,-f	enables the table formatting instead of the plain csv output.
More details and examples (Click to expand)
$ proteinortho test/*faa
$ proteinortho_summary.pl myproject.proteinortho-graph
# The adjacency matrix, the number of edges between 2 species
# file	C.faa	C2.faa	E.faa	L.faa	M.faa	
C.faa	0	1	13	18	16
C2.faa	1	0	1	1	1
E.faa	13	1	0	14	15
L.faa	18	1	14	0	42
M.faa	16	1	15	42	0

# file	average number of edges
C.faa	9.6	
C2.faa	0.8	
E.faa	8.6	
L.faa	15	
M.faa	14.8	

# The 2-path matrix, the number of paths between 2 species of length 2
# file	C.faa	C2.faa	E.faa	L.faa	M.faa	
C.faa(0)	750	47	493	855	952
C2.faa(1)	94	4	42	74	73
E.faa(2)	986	84	591	865	797
L.faa(3)	1710	148	1730	2285	499
M.faa(4)	1904	146	1594	998	2246

# file	average number of 2-paths
C.faa(0)	1088.8
C2.faa(1)	95.2
E.faa(2)	997
L.faa(3)	1374.2
M.faa(4)	1377.6
More details and examples (Click to expand)
$ proteinortho test/*faa
$ proteinortho_summary.pl myproject.proteinortho-graph myproject.blast-graph

# The adjacency matrix, the number of edges between 2 species
# file	C.faa	C2.faa	E.faa	L.faa	M.faa	
C.faa	0	0	-3	-2	-5
C2.faa	0	0	0	0	0
E.faa	-3	0	0	-4	-8
L.faa	-2	0	-4	0	-3
M.faa	-5	0	-8	-3	0

# file	average number of edges
C.faa	-2	
C2.faa	0	
E.faa	-3	
L.faa	-1.8	
M.faa	-3.2	

# The 2-path matrix, the number of paths between 2 species of length 2
# file	C.faa	C2.faa	E.faa	L.faa	M.faa	
C.faa(0)	38	0	48	27	30
C2.faa(1)	0	0	0	0	0
E.faa(2)	96	0	89	30	27
L.faa(3)	54	0	60	29	42
M.faa(4)	60	0	54	84	98

# file	average number of 2-paths
C.faa(0)	49.6
C2.faa(1)	0
E.faa(2)	59.8
L.faa(3)	45.4
M.faa(4)	59.2



Internal programs

the following programs are used by proteinortho internaly.

proteinortho_cleanupblastgraph      : used if -checkblast is set
proteinortho_graphMinusRemovegraph  : a clean up procedure to build the proteinortho-graph
proteinortho_clustering             : the main clustering program (C++)
proteinortho_ffadj_mcs.py           : the synteny program
proteinortho_formatUsearch.pl       : a format conversion tool for usearch
proteinortho_do_mcl.pl              : mcl clustering wrapper
proteinortho_treeBuilderCore        : C++ part of the UPGMA algorithm of proteinortho2tree.pl
Clone repository
  • Continuous Integration
  • Error Code
  • Error Codes
  • FAQ
  • Large compute jobs (the jobs option)
  • Tools and additional programs
  • biological examples
  • Home