Table of contents
- Output directory and overview
- Output alignments
- Supermatrix statistics
- Input alignment statistics
- Output alignment statistics
- OTU statistics
- Log file
- Paralogy Frequency Statistics
If no output directory has been set using the
--output flag, then the folder that contains the input alignments is used as the output directory. If a non-existant directory has been provided, then that directory and all of its parent directories are automatically created. All CSV files uses a comma,
,, as a field separator. After running this program, the following files are generated:
<output directory>/ └── phylopypruner_output/ ├── supermatrix_stats.csv ├── input_alignments_stats.csv ├── supermatrix.fas ├── output_alignments_stats.csv ├── otu_stats.csv ├── phylopypruner.log ├── paralogy_freq_plot.png* ├── occupancy_matrix.png* └── output_alignments/ ├── 1_pruned.fas ├── 2_pruned.fas ├── 3_pruned.fas ...
*: Only produced if Matplotlib is installed.
All output alignments are stored in a subfolder to the output folder with the path
<timestamp>_orthologs. For each alignment where an ortholog was recovered, the corresponding output alignment will retain the same name, but with
_pruned appended to it. If more than one alignment was generated for a single input alignment, an integer index will also be added to the name. Output alignments retain the filename extension from the input alignment, so a file named
16s.fas will have the corresponding output alignment
By default sequence data is kept on a single line. For a more readable output, you can wrap sequences at column
n by typing
--wrap n, where
n is a positive integer. For example, wrap sequence data at column 80 by typing
--wrap 80. Output alignments use the same name as the input alignments, but with the string
_pruned appended to the end. Note that for some paralogy pruning algorithms, such as maximum inclusion (MI), multiple orthologs may be produced for a single input file and in those cases an index will also be added to the end of the name.
In addition to individual alignments, a supermatrix is also created by concatenating all the individual alignments together into a single file called
supermatrix.fas (the filetype extension depends on your input alignments, so it might as well be called
supermatrix.fasta, for example). The range of individual gene partitions are written to the file
gene_partitions.txt. The output is similar to what you would get if you ran the individual output alignments through a alignment concatenator such as FasconCAT. Missing data is denoted by either a ('N') or a ('X'), depending on the type of data (nucleotides or amino acids). The program will automatically guess whether the data consists of nucleotides or amino acids, based on unique characters or by the amount of A, C, G, or T-characters (if these characters make up more than 50% of all individual bases, then the data is assumed to consist of nucleotides).
Figure 1. Example of an output supermatrix file, seen in the alignment viewer AliView.
AUTO, 11267 = 1-205 AUTO, 14983 = 206-379 AUTO, 05504 = 380-516 AUTO, 05749 = 517-635 AUTO, 01685 = 636-860 AUTO, 10744 = 861-1030 AUTO, 01482 = 1031-1262 AUTO, 01549 = 1263-1474 AUTO, 05894 = 1475-1703 AUTO, 01770 = 1704-1909
Figure 2. Example of a gene partition-file generated by PhyloPyPruner.
This file contains statistics of all input and output alignments, treated as a single concatenated alignment. Supermatrix statistics are stored to the
supermatrix_stats.csv file and uses a comma (',') as a field separator. If jackknifing was performed, results will be included here, but none of the alignments will be saved.
Missing data is calculated by counting the number of gap characters (gap characters recognized by PhyloPyPruner are '-', '?' or 'x') for each sequence as well as multiplying the alignment length by the number of OTUs that are missing from it. This is done per alignment and the summary of missing data for each alignment is then divided by the total number of alignments. For display purposes, this number is rounded and multiplied by 100.
Table. Example of a supermatrix statistics file.
Statistics for each individual input alignment are stored into
output_alignment_stats.csv, using a comma (',') as a field separator.
Table. Example of an input alignment statistics file.
Statistics for each individual output alignment are stored into
output_alignment_stats.csv, using a comma (',') as a field separator. The following is an example of what such a file might look like.
Table. Example of an output alignment statistics file.
Statistics for individual OTUs are stored into
- otu – The name of the OTU under study.
- paralogyFrequency – The paralogy frequency value for a specific OTU.
- timesAboveDivergenceThreshold – The number of times that a certain OTU was cut of from individual alignments due to have a ratio of pairwise distances above the divergence threshold.
The log file stores information about time and date when the analysis was made, input data, settings, supermatrix statistics in a readable format, as well as the time that the analysis took.
PhyloPyPruner version 0.3.0 Tuesday, 23. October 2018 09:42AM --------------------------------- Input data: Directory: /Users/feli/Phylogenomics/trees+alignments/Kocot_et_al_2017_Syst_Biol_Lophotrochozoa/alignments_and_FastTree_trees_pre-PhyloTreePruner Parameters: Minimum number of OTUs: 40 Minimum sequence length: 50 Long branch threshold: 4.0 Minimum support value: 0.8 Include: None Exclude: None Monophyly masking method: longest Rooting method: midpoint Outgroup rooting: ['DMEL'] Paralogy pruning method: LS Paralogy frequency threshold: 4.0 Trim divergent percentage: 0.25 Jackknife: False Input Alignments ---------------- # of alignments: 1034 # of sequences: 83508 # of OTUs: 74 avg # of sequences per alignment: 80 avg # of OTUs: 58 avg sequence length (ungapped): 177 shortest sequence (ungapped): 22 longest sequence (ungapped): 415 % missing data: 31.5 concatenated alignment length: 202420 Output Alignments ----------------- # of alignments: 1016 # of sequences: 55626 # of OTUs: 73 avg # of sequences per alignment: 54 avg # of OTUs: 54 avg sequence length (ungapped): 183 shortest sequence (ungapped): 50 longest sequence (ungapped): 415 % missing data: 32.8 concatenated alignment length: 200398 ----------------------- Run time: 81.74 seconds
Paralogy frequency (PF) calculates the number of paralogs for a OTU divided by the number of alignments that said OTU is present in. This data is saved to a CSV file called
otu_stats.csv and, if Matplotlib is installed, a PF plot will be saved to
paralogy_freq_plot.png, similar to the plot in.
Figure 3. Paralogy Frequency Plot.
A heatmap of the amount of bases present per OTU and gene partition is stored into the file
occupancy_matrix.png. The user can choose to remove partitions and or OTUs with less than a certain amount of occupancy through the flags
--min-otu-occupancy. Gene partitions and OTUs which were removed via these filters are shown in red within this heatmap.
Figure 4. An occupancy matrix.