Paul Klemm · 098c701b
--- a/Tools-and-additional-programs.md
+++ b/Tools-and-additional-programs.md
+# Table of Contents
+
+*  [Conversion Tools](Conversion Tools)
+
+
+<h1>Conversion Tools</h1>
+<h3>proteinortho.tsv → html</h3>
+
+tool: **proteinortho2html.pl** 
+
+```
+USAGE: proteinortho2html.pl <myproject.proteinortho> (<fasta1> <fasta2> ...)
+The first argument points to the proteinortho output (tsv)-file. 
+Any further (optional) files should be fasta files, for conversion of the identifier to a proper gene name/ describtion. 
+The HTML output is printed to stdout, use '>' to write the html output to a file.
+```
+
+---
+<br>
+
+<h3>proteinortho.tsv → OrthoXML</h3>
+
+tool: **proteinortho2xml.pl** 
+
+```
+proteinortho2xml.pl <PROTEINORTHOFILE>
+Reads Proteinortho file (not proteinortho-graph file!) and produces the OrthoXML format (>stdout).
+```
+
+for more information about the OrthoXML format see [orthoxml.org](http://www.orthoxml.org/xml/Main.html)
+
+---
+<br>
+
+<h3>proteinortho.tsv → tree (newick format)</h3>
+
+tool: **proteinortho2tree.pl** 
+
+```
+usage:   proteinortho2tree.pl [OPTIONS] ORTHOMATRIX(.tsv) >OUTTREE
+input:   output from Proteinortho (version >5) e.g. myproject.proteinortho.tsv 
+output:  corresponding UPGMA tree in newick format
+options: -o=[FILE]  prints output to the given file rather than STDOUT  
+```
+
+---
+<br>
+
+
+<h1>extract results</h1>
+
+<h3>protein ids/names or proteinortho.tsv → extract protein-sequences (fasta)</h3>
+
+tool: **proteinortho_grab_proteins.pl** 
+
+```
+proteinortho_grab_proteins.pl        greps all genes/proteins of a given fasta file
+ 
+SYNOPSIS
+ 
+proteinortho_grab_proteins.pl (options) QUERY FASTA1 (FASTA2 ...)
+
+	QUERY	proteinortho.tsv FILE or search STRING or '-' for STDIN:
+		a)	proteinortho output file (.tsv). This uses by default the -exact option.
+		b)	string of one identifier e.g. 'tr|asd3|asd' OR multiple identifier separated by ',' (-F=)
+	FASTA*	fasta file(s) (database)
+
+	(options):
+		-tofiles, -t  print everything to files instead of stdout files are called OrthoGroup**.fasta for a proteinortho.tsv file
+		-E            enables regex matching otherwise the string is escaped (e.g. | -> \|)
+		-exact        search patters are extended with a \b, that indicates end of word.
+		-source, -s   adds the filename (FASTA1,...) to the found gene-name
+		-F=s          char delimiter for multiple identifier if QUERY is a string input (default: ',')
+```
+
+<details>
+  <summary>More details and examples (Click to expand)</summary>
+
+```
+DESCRIPTION
+ 
+	This script finds and extract all given identifier of a list of fasta files. 
+	The identifier can be provided as a simple string 'BDNF1', regex string 'BDNF*' 
+	or in form of a proteinortho output file (myproject.proteinortho.tsv).
+       
+	Example:
+ 
+ 	# 1. most simple call:
+
+	perl proteinortho_grab_proteins.pl 'BDNF1' *.faa
+
+		STDOUT:
+			>BDNF1 Brain derived neurotrophic factor OS=human(...)
+			MNNGGPTEMYYQQHMQSAGQPQQPQTVTSGPMSHYPPAQPPLLQPGQPYSHGAPSPYQYG
+			>BDNF15 Brain derived neurotrophic factor OS=human(...)
+			MAFPLHFSREPAHAIPSMKAPFSRHEVPFGRSPSMAIPNSETHDDVPPPLPPPRHPPCTN
+
+	    The second hit BDNF15 is reported since it also contains 'BDNF1' as a substring. 
+	    To prevent such a behaviour use proteinortho_grab_proteins.pl -E 'BDNF1\b'. 
+	    The \b marks the end of a word and -E enables regex expressions.
+
+	    Or simply add -exact: perl proteinortho_grab_proteins.pl -exact 'BDNF1' *.faa
+
+ 	# 2. multiple ids:
+
+	perl proteinortho_grab_proteins.pl 'BDNF1,BDNF2,BDNF3' *.faa
+
+ 	# 3. more complex regex search:
+
+	perl proteinortho_grab_proteins.pl -E 'B?DNF[0-3]3+' *.faa
+
+		This finds: BDNF13, BDNF23, DNF13, DNF033, ... 
+
+ 	# 4. proteinortho tsv file and write output to files:
+
+	proteinortho_grab_proteins.pl -tofiles myproject.proteinortho.tsv test/*.faa
+
+		This will produce the files: OrthoGroup0.fasta, OrthoGroup1.fasta, OrthoGroup2.fasta, ...
+		Each fasta file contains all genes of one orthology group (one line in myproject.proteinortho.tsv)
+```
+
+</details>
+
+
+---
+<br>
+
+
+<h3>protein id/name → history</h3>
+
+tool: **proteinortho_history.pl** 
+
+```
+proteinortho_history.pl        reports the history of a (or a pair of) gene/protein(s).
+ 
+SYNOPSIS
+ 
+proteinortho_history.pl (-project=myproject) QUERY (FASTA1 FASTA2 ...)
+
+	QUERY	A string of a single gene/protein or 2 separated by a comma or a whitespace (the input is escaped using quotemeta, use -noquotemeta to avoid this)
+
+	-project=MYPROJECT	The project name (as specified in proteinortho with -project) (default:myproject)
+	-step=[123] 		(optional) If specified more optput is printed (to STDOUT) for the given step:
+		-step=1 : search for the given fasta sequence in the input fasta files
+		-step=2 : search in the *.blast-graph
+		-step=3 : search in the *.proteinortho file (default:nothing)
+		-step=all : prints everything of above to STDOUT
+	FASTA*				(optional) input fasta files 
+	-noquotemeta 		(optional) If set, then the query will not be escaped.
+	-delim= 		(optional) Defines the delimiter character for spliting the query (if you want to search for 2 genes/proteins)
+
+	NOTE: if you use the -keep option and you have the project_cache_proteinortho/ directory, this program additionally searches for all blast hits.
+```
+
+---
+<br>
+
+
+<h3>two proteinortho-graph files → difference</h3>
+
+tool: **proteinortho_compareProteinorthoGraphs.pl** 
+
+```
+Usage: proteinortho_compareProteinorthoGraphs.pl FILE_A FILE_B
+
+Compares two Proteinortho-graph files and reports additional and different entrys.
+ D = different
+ O = only here
+```
+
+---
+<br>
+
+<h3>proteinortho.tsv → singles</h3>
+
+tool: **proteinortho_history.pl** 
+
+```
+proteinortho_singletons.pl FASTA1 FASTA2 FASTAN <PROTEINORTHO_OUTFILE
+Reads Proteinortho outfile and its source fasta files to determin entries which occure once only
+```
+
+---
+<br>
+
+<h1>Internal programs</h1>
+
+the following programs are used by proteinortho internaly. 
+
+```
+proteinortho_cleanupblastgraph      : used if -checkblast is set
+proteinortho_graphMinusRemovegraph  : a clean up procedure to build the proteinortho-graph
+proteinortho_clustering             : the main clustering program (C++)
+proteinortho_ffadj_mcs.py           : the synteny program
+proteinortho_formatUsearch.pl       : a format conversion tool for usearch
+proteinortho_do_mcl.pl              : mcl clustering wrapper
+proteinortho_treeBuilderCore        : C++ part of the UPGMA algorithm of proteinortho2tree.pl
+```