Table of contents
- Output directory and overview
- Output alignments
- Supermatrix
- Supermatrix statistics
- Input alignment statistics
- Output alignment statistics
- OTU statistics
- Log file
- Paralogy Frequency Statistics
- Occupancy
Output directory and overview
If no output directory has been set using the --output
flag, then the folder that contains the input alignments is used as the output directory. If a non-existant directory has been provided, then that directory and all of its parent directories are automatically created. All CSV files uses a comma, ,
, as a field separator. After running this program, the following files are generated:
<output directory>/
└── phylopypruner_output/
├── supermatrix_stats.csv
├── input_alignments_stats.csv
├── supermatrix.fas
├── output_alignments_stats.csv
├── otu_stats.csv
├── phylopypruner.log
├── paralogy_freq_plot.png*
├── occupancy_matrix.png*
└── output_alignments/
├── 1_pruned.fas
├── 2_pruned.fas
├── 3_pruned.fas
...
*: Only produced if Matplotlib is installed.
Output alignments
All output alignments are stored in a subfolder to the output folder with the path <timestamp>_orthologs
. For each alignment where an ortholog was recovered, the corresponding output alignment will retain the same name, but with _pruned
appended to it. If more than one alignment was generated for a single input alignment, an integer index will also be added to the name. Output alignments retain the filename extension from the input alignment, so a file named 16s.fas
will have the corresponding output alignment 16s_pruned.fas
.
By default sequence data is kept on a single line. For a more readable output, you can wrap sequences at column n
by typing --wrap n
, where n
is a positive integer. For example, wrap sequence data at column 80 by typing --wrap 80
. Output alignments use the same name as the input alignments, but with the string _pruned
appended to the end. Note that for some paralogy pruning algorithms, such as maximum inclusion (MI), multiple orthologs may be produced for a single input file and in those cases an index will also be added to the end of the name.
Supermatrix
In addition to individual alignments, a supermatrix is also created by concatenating all the individual alignments together into a single file called supermatrix.fas
(the filetype extension depends on your input alignments, so it might as well be called supermatrix.fasta
, for example). The range of individual gene partitions are written to the file gene_partitions.txt
. The output is similar to what you would get if you ran the individual output alignments through a alignment concatenator such as FasconCAT. Missing data is denoted by either a ('N') or a ('X'), depending on the type of data (nucleotides or amino acids). The program will automatically guess whether the data consists of nucleotides or amino acids, based on unique characters or by the amount of A, C, G, or T-characters (if these characters make up more than 50% of all individual bases, then the data is assumed to consist of nucleotides).
Figure 1. Example of an output supermatrix file, seen in the alignment viewer AliView.
AUTO, 11267 = 1-205
AUTO, 14983 = 206-379
AUTO, 05504 = 380-516
AUTO, 05749 = 517-635
AUTO, 01685 = 636-860
AUTO, 10744 = 861-1030
AUTO, 01482 = 1031-1262
AUTO, 01549 = 1263-1474
AUTO, 05894 = 1475-1703
AUTO, 01770 = 1704-1909
Figure 2. Example of a gene partition-file generated by PhyloPyPruner.
Supermatrix statistics
This file contains statistics of all input and output alignments, treated as a single concatenated alignment. Supermatrix statistics are stored to the supermatrix_stats.csv
file and uses a comma (',') as a field separator. If jackknifing was performed, results will be included here, but none of the alignments will be saved.
Missing data is calculated by counting the number of gap characters (gap characters recognized by PhyloPyPruner are '-', '?' or 'x') for each sequence as well as multiplying the alignment length by the number of OTUs that are missing from it. This is done per alignment and the summary of missing data for each alignment is then divided by the total number of alignments. For display purposes, this number is rounded and multiplied by 100.
Table. Example of a supermatrix statistics file.
id | alignments | sequences | otus | meanSequences | meanOtus | meanSeqLen | shortestSeq | longestSeq | pctMissingData | catAlignmentLen |
---|---|---|---|---|---|---|---|---|---|---|
input | 1034 | 83508 | 74 | 80 | 58 | 177 | 22 | 415 | 31.5 | 202420 |
output_BGRA_excluded | 1016 | 55626 | 73 | 54 | 54 | 183 | 50 | 415 | 32.8 | 200398 |
Input alignment statistics
Statistics for each individual input alignment are stored into output_alignment_stats.csv
, using a comma (',') as a field separator.
Table. Example of an input alignment statistics file.
filename | otus | sequences | meanSeqLen | shortestSeq | longestSeq | pctMissingData | alignmentLen | shortSequencesRemoved | longBranchesRemoved | monophyliesMasked | nodesCollapsed | divergentOtusRemoved |
---|---|---|---|---|---|---|---|---|---|---|---|---|
08799.fa | 63 | 86 | 163 | 59 | 184 | 0.112424165824 | 184 | 0 | 0 | 23 | 28 | 0 |
02792.fa | 67 | 82 | 278 | 120 | 321 | 0.131790897348 | 321 | 0 | 0 | 15 | 23 | 0 |
05462.fa | 52 | 63 | 232 | 128 | 260 | 0.106837606838 | 260 | 0 | 0 | 12 | 25 | 1 |
01029.fa | 57 | 68 | 211 | 106 | 233 | 0.0903181014895 | 233 | 0 | 0 | 11 | 21 | 0 |
05466.fa | 62 | 92 | 128 | 69 | 139 | 0.0734282139506 | 139 | 0 | 0 | 28 | 29 | 2 |
Output alignment statistics
Statistics for each individual output alignment are stored into output_alignment_stats.csv
, using a comma (',') as a field separator. The following is an example of what such a file might look like.
Table. Example of an output alignment statistics file.
filename | otus | sequences | meanSeqLen | shortestSeq | longestSeq | pctMissingData | alignmentLen |
---|---|---|---|---|---|---|---|
08799_pruned.fa | 63 | 63 | 167 | 70 | 184 | 0.0902346445825 | 184 |
02792_pruned.fa | 66 | 66 | 283 | 120 | 321 | 0.11611441518 | 321 |
05462_pruned.fa | 48 | 48 | 233 | 129 | 260 | 0.103605769231 | 260 |
01029_pruned.fa | 56 | 56 | 218 | 106 | 233 | 0.064071122011 | 233 |
05466_pruned.fa | 60 | 60 | 133 | 69 | 139 | 0.0398081534772 | 139 |
OTU statistics
Statistics for individual OTUs are stored into otu_stats.csv
.
- otu – The name of the OTU under study.
- paralogyFrequency – The paralogy frequency value for a specific OTU.
- timesAboveDivergenceThreshold – The number of times that a certain OTU was cut of from individual alignments due to have a ratio of pairwise distances above the divergence threshold.
otu | paralogyFrequency | timesAboveDivergenceThreshold |
---|---|---|
PCAL | 30.4 | 13 |
SAGI | 0 | 4 |
PCER | 10.3 | 33 |
PARC | 39.2 | 0 |
BNER | 25.3 | 1 |
Log file
The log file stores information about time and date when the analysis was made, input data, settings, supermatrix statistics in a readable format, as well as the time that the analysis took.
PhyloPyPruner version 0.3.0
Tuesday, 23. October 2018 09:42AM
---------------------------------
Input data:
Directory: /Users/feli/Phylogenomics/trees+alignments/Kocot_et_al_2017_Syst_Biol_Lophotrochozoa/alignments_and_FastTree_trees_pre-PhyloTreePruner
Parameters:
Minimum number of OTUs: 40
Minimum sequence length: 50
Long branch threshold: 4.0
Minimum support value: 0.8
Include: None
Exclude: None
Monophyly masking method: longest
Rooting method: midpoint
Outgroup rooting: ['DMEL']
Paralogy pruning method: LS
Paralogy frequency threshold: 4.0
Trim divergent percentage: 0.25
Jackknife: False
Input Alignments
----------------
# of alignments: 1034
# of sequences: 83508
# of OTUs: 74
avg # of sequences per alignment: 80
avg # of OTUs: 58
avg sequence length (ungapped): 177
shortest sequence (ungapped): 22
longest sequence (ungapped): 415
% missing data: 31.5
concatenated alignment length: 202420
Output Alignments
-----------------
# of alignments: 1016
# of sequences: 55626
# of OTUs: 73
avg # of sequences per alignment: 54
avg # of OTUs: 54
avg sequence length (ungapped): 183
shortest sequence (ungapped): 50
longest sequence (ungapped): 415
% missing data: 32.8
concatenated alignment length: 200398
-----------------------
Run time: 81.74 seconds
Paralogy Frequency statistics
Paralogy frequency (PF) calculates the number of paralogs for a OTU divided by the number of alignments that said OTU is present in. This data is saved to a CSV file called otu_stats.csv
and, if Matplotlib is installed, a PF plot will be saved to paralogy_freq_plot.png
, similar to the plot in.
Figure 3. Paralogy Frequency Plot.
Occupancy
A heatmap of the amount of bases present per OTU and gene partition is stored into the file occupancy_matrix.png
. The user can choose to remove partitions and or OTUs with less than a certain amount of occupancy through the flags --min-gene-occupancy
and --min-otu-occupancy
. Gene partitions and OTUs which were removed via these filters are shown in red within this heatmap.
Figure 4. An occupancy matrix.