Table of contents
- Orthology Primer
- Orthology Inference
- Contamination-like Issues
- Homology – Shared features due to common ancestry.
- Homologous genes (=homologs) – Genes that share a common ancestry.
- Paralogous genes (=paralogs) – Homologous genes derived by a duplication event.
- Orthologous genes (=orthologs) – Homologous genes derived by a speciation event.
- Inparalogs – Paralogs which were derived through a species-specific gene duplication event.
- Outparalogs – Non-inparalogous paralogs.
- One-to-one orthologs – A set of genes where no OTU is represented by more than one copy per gene.
- Many:many orthologs – A set of genes where two, or more, orthologs are present.
- Graph-based orthology inference – Methods for obtaining a set of homologs from a set of loci based on similarity.
- Tree-based orthology inference – Methods for obtaining 1:1 orthologs from a set of homologs using phylogenetic trees.
PhyloPyPruner is a Python package for refining the output of a graph-based orthology inference approach such as OrthoMCL, OrthoFinder or HaMStR. Similar to other tree-based orthology inference methods (e.g., PhyloTreePruner, UPhO, Agalma and Phylogenomic Dataset Reconstruction), it uses phylogenetic trees in order to obtain genes that are 1:1 orthologous. In addition to filters and algorithms seen in pre-existing tools, this package provides new methods for differentiating contamination-like sequences from paralogs.
Processes such as lineage-splitting, gene duplication and gene extinction events affects gene relationships and various terms has hence been introduced to describe the different ways in which genes can be related. Genes which share a common ancestry are said to be homologous (homologous genes are also known as homologs). In addition, homologs are further subcategorized into various homology subclasses: Homologs related by a speciation event are known as orthologous genes (or orthologs), whereas homologs related by a gene duplication event are known as paralogous genes (or paralogs). Furthermore, inparalogs (a species-specific gene duplication) are different from outparalogs (non-inparalogous paralogs).
Orthology inference is the process of deriving sequences related by speciation (i.e., orthologs), rather than gene duplication (i.e., paralogs), from a set of loci which share common ancestry. Distinguishing between orthologs and paralogs is critical to phylogenetic tree reconstruction, as comparing paralogs in a phylogenetic context results in a tree which reflects the history of gene duplication events rather than speciation events. Current approaches for orthology inference can be grouped into graph-based and tree-based methods.
Note. We seldom know the true history of genes and therefor the orthologs we infer are hypotheses.
Graph-based 'orthology inference' methods cluster sequences together based on an all-versus-all BLAST followed by filtering by hit-fraction, Markov clustering or both. Software packages that use a graph-based approach include OrthoMCL, OrthoFinder and HaMStR.
For example, in OrthoMCL, putative orthologs are identified by finding the reciprocal best hits (RBH) across different proteomes. Two sequences, from two different species, are each others RBH if and only if they both are more similar to one another than to any other sequence across the two different species. In addition, putative "recent" paralogs are identified by finding sequences, within the same species, which are reciprocally more similar to each other than to any other sequence outside of the species. From this information, graphs are created where nodes represent sequences and edges their relationship. The paralogs are used to adjust the weighting of the edges, in order to avoid biases that arise due to the strong similarity between sequences. A Markov clustering algorithm (MCL) is then separate the sequences into different orthologous groups. The MCL simulates random walks across the different nodes, in order to assess the transition probabilities across different sequences.
Since the output from a graph-based approach often contains paralogs or many:many orthologs (a group where more than one set of orthologs has been recognized), a tree-based approach may be employed to derive strict 1:1 orthologs, which is a requirement for phylogenetic inference. As the name implies, these methods make use of phylogenetic trees, in order to refine the output of a graph-based approach. Each homologous group is aligned and a phylogenetic tree is inferred for each alignment. The orthology inference program then uses the information within this tree, to identify subsets of 1:1 orthologs, where no species is present more than once for each loci.
Figure 1. An overview of a tree-based orthology inference approach. First, sequences from different species are clustered together based on an all-versus-all BLAST, followed by Markov clustering. Each node in the cluster corresponds to a sequence and each edge corresponds to a similarity score. Homologous groups are then aligned and a phylogenetic tree is inferred from the alignment. From this tree, orthologous groups can be identified and paralogs are pruned away.
An example of tree-based approaches are Agalma, PhyloTreePruner, UPhO, TreeKo and Phylogenomic Dataset Reconstruction. During tree-based orthology inference, two main processes can be recognized: monophyly masking and paralogy pruning.
During monophyly masking, inparalogs (species-specific gene duplications) and or isoforms (products derived from alternative splicing) that form a monophyly and come from the same species are replaced by a single sequence. From the sequences within the group, one sequence is chosen by picking a sequence (1) by random, (2) by choosing the longest sequence, or (3) by choosing the sequence with the shortest pairwise distance to the rest of the sequences within the alignment.
Paralogy pruning is the process of deriving 1:1 orthologs from a set of homologs, recognized by a graph-based orthology inference approach, through the guidance of a phylogenetic tree. Several methods are available for paralogy pruning, but all of them uses phylogenetic trees in some kind of way. The name paralogy pruning comes from the paralogy trimming step, which is necessary to derive strict 1:1 orthologs from a set of homologs. All the methods that PhyloPyPruner provides are further explained in the methods section of this Wiki.
In the following table, we compare the various paralogy pruning methods found across different tree-based orthology inference programs.
Table 1: Comparison chart of paralogy pruning methods in tree-based programs.
|Method||PhyloPyPruner||PhyloTreePruner||Agalma||UPhO||Phylogenomic Dataset Reconstruction|
|Largest subtree (LS)||✓||✓|
|Maximum inclusion (MI)||✓||✓||✓||✓|
|Rooted tree (RT)||✓||✓|
|Monophyletic outgroups (MO)||✓||✓|
|1to1 orthologs (1to1)||✓||✓|
Definition. Contamination-like sequences are those interpreted as paralogs by earlier tree-based approaches, but, based on their position in a tree relative to other sequences from the same taxon and a lack of evidence for paralogy in other taxa, are most likely the result of exogenous contamination, misalignment, sequencing errors, etc.
Contaminant sequences or misidentified paralogy can cause tree-based approaches to erroneously infer paralogy and unnecessarily exclude many sequences. Sequence contamination can be introduced in during the sampling step (due to symbionts, foreign sequences from gut content, or other non-target organisms living in close association with the target organism (e.g., epibionts or parasites). Little in the way of bioinformatic tools are available to help identify and reduce contamination in phylogenomic datasets today and most of the available tools rely on information that is not always available when your sequences are derived from, for example, a public database.
This programs provides various tools for reducing the amount of contamination-like sequences. Cross-contaminants may be identified and removed by setting a minimum threshold for pairwise distance across species. Furthermore, individual OTUs with a negative impact on the results can be identified by our novel paralogy frequency metric (number of paralogs present per species, normalized by the times said species is present) or by a jackknife approach. The jackknife approach provides an extensive statistical report, which shows the impact of removing each OTU, one by one. Finally, the stability of expected monophyletic groups may be asserted by defining subclades. The program then provides a report on how many times said group forms a monophyletic group across all trees, after various filters have been applied.