I dont understand the proteinortho output
general information regarding the output file format can be found here
In short :
the "*.proteinortho-graph" file gives you the pairwise orthology relation, that is: which protein is orthologous to which other protein.
The "*.proteinortho.tsv" file gives you groups of orthologs (usually this is what you are interested in).
Each row corresponds to a group of orthologous proteins (each column corresponds to an input species).
You can also open the "*.html" output to explore the proteinortho.tsv or simply open the tsv in excel/OpenOffice.
Then it is just a matter what you are looking for. If you know proteinXYZ in one of the species, you can search for this in the proteinortho.tsv file and find all corresponding orthologs in the other species if there are any.
You can then extract sequences of these with: proteinortho_grab_proteins.pl and look at a sequence alignment to validate your results.
Why do we use (adaptive) reciprocal best hit?
The adaptive reciprocal best hit algorithm greatly reduces the number of false positives compared to a simple BLAST analysis.
The definition of reciprocal adaptive best hit:
x and y are orthologous if x is a good homolog of y (highest bitscore or at least 95% of the maximum) and y is a good homolog of x (...).
More details can be found here and about the clustering too.
What does the similarity option
-sim do? What is a good value for the similarity?
The -sim relaxes the one best reciprocal hit to all adaptive reciprocal best hits.
E.g. for -sim=0.9 all reciprocal hits within 90% of the highest bitscore are returned.
This can reduces false negative on the cost of maybe increasing the false positives too.
The default thresholds -sim=0.95 works for a lot of different datasets:
orthology.benchmarkservice.org see the 2017 results on proteinortho in eukaryota, fungi, bacteria
Can I align reads to a genome with proteinortho?
No, that is a very different problem. Proteinortho compares proteins with proteins or nucleotide sequences with nucleotide sequences. You are maybe looking e.g. for segemehl ->issue26.
Why are my 2 proteins of interest not orthologous?
Please use the proteinortho_history.pl tool, to track down your favorite protein/ pair of proteins. This tool will also return why something is not orthologous.
Are my species all well connected with each other?
Please use the proteinortho_summary.pl tool, analyse the overall connectivity of your input species. If you find a low connected species, then maybe you need to add more species to your analysis, that are more closely related to this species.
You can also analyse the difference of 2 given graph files, e.g. how much does the clustering change the result or compare different parameters of proteinortho
proteinortho_summary.pl myproject.proteinortho-graph myproject.blast-graph
proteinortho_summary.pl myproject_sim95.proteinortho-graph myproject_sim50.blast-graph
How do I get more output?
Use the -debug=1 or -debug=2 option to get more STDERR output. Be careful this will probably flood your terminal with information...
Can I get the paralogs too?
Yes use the
Singletons are missing in my proteinortho output (proteins without any connection)
--singletons option to output the singletons too.
How do I extract the sequences of one or all orthogroups?
You can use the proteinortho_grab_proteins.pl tool.
(i) either you can provide your *proteinortho.tsv file, then the script generates a fasta for each orthogroup e.g.
proteinortho_grab_proteins.pl myproject.proteinortho.tsv test/*faa -tofiles
(ii) or you can provide the identifier of a single group. e.g.
proteinortho_grab_proteins.pl "L_641,L_643,M_640,M_642,M_649" test/*faa >myfile.faa
Proteinortho does not use all provided cores / is very slow
If you use the -keep option there is a known I/O bottleneck if many many small fasta files are used as input. Consider not to use the -keep option to speed up diamond. (see issue 36)