inconsistent results (for polyploid organisms)
Dear BUSCO team,
First of all I like to thank you for providing BUSCO which I use a lot. However, today I like to report something a noticed recently.
I'm working with polyploid plants, i.e., plant that carry several copies of a subgenome. These subgenomes are slightly different and can be distinguished. In my current case, I 'm working with wheat which has subgenome A, B and D. Running BUSCO on polyploid plants overestimates the number of duplicated genes. Hence, I thought I should run BUSCO one the individual subgenomes. To my surprise the results differ at an unexpected point.
version: BUSCO version 5.2.2
current behaviour: I have created a minimal example (cf. attached file) containing only two proteins one from subgenome A and one from subgenome B. Running BUSCO on a fasta file containing both proteins or containing only the protein from A or B subgenome, I always receive one complete and single-copy BUSCO.
expected behavior: I would expect consistency, i.e., either that the combined approach returns a duplicated BUSCO or that the protein on the A subgenome would not be classified as complete and single-copy BUSCO.
command line:
busco -c 20 -m proteins -i toy_example.fasta -o toy-combined -l poales_odb10
head -n2 toy_example.fasta > toy-example-A.fasta
busco -c 20 -m proteins -i toy-example-A.fasta -o toy-A -l poales_odb10
tail -n2 toy_example.fasta > toy-example-B.fasta
busco -c 20 -m proteins -i toy-example-B.fasta -o toy-B -l poales_odb10
grep -P "^54at38820" toy-*/run_poales_odb10/full_table.tsv
output:
[...]
toy-A/run_poales_odb10/full_table.tsv:54at38820 Complete TraesJUL5A01G205200.1 2983.8 1836 https://www.orthodb.org/v10?query=54at38820 Histidine kinase- DNA gyrase B- and HSP90-like ATPase family protein
toy-B/run_poales_odb10/full_table.tsv:54at38820 Complete TraesJUL5B01G498000.1 4073.3 2154 https://www.orthodb.org/v10?query=54at38820 Histidine kinase- DNA gyrase B- and HSP90-like ATPase family protein
toy-combined/run_poales_odb10/full_table.tsv:54at38820 Complete TraesJUL5B01G498000.1 4073.3 2154 https://www.orthodb.org/v10?query=54at38820 Histidine kinase- DNA gyrase B- and HSP90-like ATPase family protein
I checked the parameters but did not find any obvious parameter to be set. Checking your manual, I found the following points:
- „The ‘expected score’ cut-off is defined as 90% of the minimum bitscore from an HMM search of all of a BUSCO group’s members against its own HMM profile (i.e. the lowest scoring match of the sequences used to build the profile).“
- „To be classified as a true ortholog, any BUSCO-matching gene from the species being assessed (from its genome, transcriptome, or gene set) must score above the ‘expected score’ cut-off. For a match to be classified as ‘complete’, it must satisfy the ‘expected-length’ cut-off, which is defined using each BUSCO group’s protein length distribution (Figure S1). Any BUSCO-matching gene from the species being assessed whose protein length falls within two standard deviations (2σ) of the BUSCO group’s mean length is classified as ‘complete’.“
https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioinformatics/31/19/10.1093_bioinformatics_btv351/2/btv351_Supplementary_Data.zip?Expires=1639643315&Signature=5EkUaYEt1yywIXyZjGzHJfb95gz0oqifJtFG92nDwE5o0UKKPHlDcpKttbkIQTwALda-WlMgFllxlD7WLjEZqXEdvyj1nWj58bDymBg4LVqdCWCi7CTR-HRNv-ZF94rlJY25yimvyI4o0SUULTF3S-dzbzHepfasUK-dZRCGTXBOxAv8f1G-CyHQdoDOGyo8ElefTl1kwGQ6yToC6H2aSYvDqOTzWVkOtJTIEgndAQ40wvClgDZZ0Xr38uNIBWsuoeCHgRbuQbpbVZVYgXO-neJbm48mo9YHVG0dhjan1cCChgyYu9NRT7uT7zl9I493me69sAKA8USSBg64xQNQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA
Hence, I assumed the threshold should be independent of the input (-i). Maybe some heuristic steps that are used in HMMER (cf. HMMER parameter --max, http://eddylab.org/software/hmmer/Userguide.pdf) could lead to the current behaviour, but I did not see a BUSCO parameter to change and test this.
The current behaviour could also occur in diploid organisms and not only in polyploid organisms. So it might be a general problem.
Looking forward to your reply.