Hi Matthew, many thanks for the swift response. I have uploaded one of the input files in the link you have provided me. I submitted it under the name "Carabus_problematicus_genomic.fna". I downloaded the genome from https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_963422195.1/
Hi, I would like to try to replicate this. Can you please share one of the input files that causes this error? This link is to a secure folder, only accessible by the BUSCO developers: https://drive.switch.ch/index.php/s/racy9WzmFuy1k2C
I submitted an array job to my cluster to run BUSCO on 239 genomes. However, seven of them failed. In the log file, it's written:
.... (log file truncated)....
2024-03-27 11:51:53 ERROR:busco.BuscoRunner list index out of range
2024-03-27 11:51:53 DEBUG:busco.BuscoRunner list index out of range
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/busco/BuscoRunner.py", line 165, in run
self.runner.run_analysis()
File "/usr/local/lib/python3.7/site-packages/busco/BuscoRunner.py", line 564, in run_analysis
self.analysis.run_analysis()
File "/usr/local/lib/python3.7/site-packages/busco/analysis/GenomeAnalysis.py", line 866, in run_analysis
busco_ids=incomplete_buscos,
File "/usr/local/lib/python3.7/site-packages/busco/BuscoLogger.py", line 62, in wrapped_func
self.retval = func(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/busco/analysis/BuscoAnalysis.py", line 136, in run_hmmer
self.validate_output()
File "/usr/local/lib/python3.7/site-packages/busco/analysis/GenomeAnalysis.py", line 195, in validate_output
overlaps, exon_records, hmmer_results
File "/usr/local/lib/python3.7/site-packages/busco/analysis/GenomeAnalysis.py", line 427, in handle_overlaps
overlap_entry, exon_records, hmmer_results
File "/usr/local/lib/python3.7/site-packages/busco/analysis/GenomeAnalysis.py", line 467, in handle_diff_busco_overlap
if hmmer_match_details1[0]["score"] > hmmer_match_details2[0]["score"]:
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/busco/run_BUSCO.py", line 74, in run
runner.run()
File "/usr/local/lib/python3.7/site-packages/busco/BuscoLogger.py", line 62, in wrapped_func
self.retval = func(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/busco/BuscoRunner.py", line 197, in run
raise BatchFatalError(str(exc_value))
busco.Exceptions.BatchFatalError: list index out of range
2024-03-27 11:51:53 ERROR:busco.BuscoRunner BUSCO analysis failed!
2024-03-27 11:51:53 ERROR:busco.BuscoRunner Check the logs, read the user guide (https://busco.ezlab.org/busco_userguide.html), and check the BUSCO issue board on https://gitlab.com/ezlab/busco/issues
2024-03-27 11:51:54 DEBUG:urllib3.connectionpool Starting new HTTPS connection (1): busco-data.ezlab.org:443
2024-03-27 11:51:54 DEBUG:urllib3.connectionpool Starting new HTTPS connection (1): busco-data.ezlab.org:443
2024-03-27 11:51:54 DEBUG:urllib3.connectionpool https://busco-data.ezlab.org:443 "PUT /upload/rundata202403271151543254974.txt HTTP/1.1" 201 0
2024-03-27 11:51:54 DEBUG:urllib3.connectionpool https://busco-data.ezlab.org:443 "PUT /upload/rundata202403271151543254974.txt HTTP/1.1" 201 0
2024-03-27 11:51:54 DEBUG:busco.run_BUSCO File uploaded successfully.
It seems like the BUSCO run was successful but it didn't produce the busco sequences. I ran this job using parameters as follow, with BUSCO v 5.7.0:
busco -i dataset/${REFGEN}/${REFGEN}_genomic.fna -m genome -l endopterygota_odb10 -c 20 -f -o dataset/${REFGEN}/${REFGEN} --download_path busco_downloads --metaeuk &> busco_${REFGEN}.log
Could you help me figure out the issue and how to solve it? Many thanks!
Hi Matthew,
Thanks for your response. I just updated and it's running, once it's finished or crashed I'll be back with updates.
2024-03-26 11:27:57 ERROR:
Warning message:
The size
argument of element_line()
is deprecated as of ggplot2 3.4.0.
i Please use the linewidth
argument instead.
Error: C stack usage 7971284 is too close to the limit
Execution halted
Hi, I uploaded 10367_contigs.fasta.gz (187MB) to your folder.
I run it (after gunzipping) with:
docker run -u $(id -u) -v $(pwd):/busco_wd ezlabgva/busco:v5.7.0_cv1 busco -m geno -c 24 -l lepidoptera_odb10 -i 10367_contigs.fasta
(on a 12-core AMD Ryzen, so 24 hyperthreads)
It runs for about 40 seconds, then it is done with miniprot_align and it enters into this 30-40 minute long single-thread phase..
End result should be something around 2300 complete BUSCOs out of 5286 possible in the dataset.
Hi Bjorn, my tests so far have not indicated any improvement when I parallelise this post-processing step. It's a little unexpected, and it may be something to do with the files I'm using to test. If you are happy to share your input file to a secure folder I can do some tests with it too. Here's a link to a secure folder: https://drive.switch.ch/index.php/s/MRUkHW4Oz6u8i8A
The missing GFF files in the Metaeuk pipeline is a bug - thanks for pointing it out. We'll fix that shortly.
A slight correction to my last comment - different CDS regions may be on different reading frames, but a single CDS region should not switch reading frames, which is what we sometimes observe with Miniprot. So if you attempt to translate the CDS regions you may not get the protein sequence reported by Miniprot.
The Miniprot approach is useful for ascertaining if a gene is present, but as Miniprot will find a gene even if there are frameshift errors in the input we cannot automatically trust the original sequence for downstream analysis. The GFF info will indicate if there is a frameshift in a given CDS region, however, so if a prediction does not report a frameshift or other anomaly you can proceed with extracting the CDS yourself.
I see.. so you explicitly avoid outputting the sequences as it would be confusing that there could be frameshifts inside them, disturbing a simple translation? I should go have a more thorough look into Miniprot I guess!
But actually I found now that with 5.7.0 and --metaeuk you do get the .fna files, but now you don't get the .gff files anymore in the output, so that is another change. Do you know if it's possible to get them back?
If you extract the full CDS region and translate it you will find that in many cases the Miniprot protein sequence will switch between different reading frames. The Miniprot developers advertise the fact that it is flexible with frameshifts, and it generates a lot of matches as a result.
However CDS regions in GFF files should all be in the same reading frame. As they are often not in Miniprot GFF files we decided not to create the nucleotide sequences.
I think a typical server that can handle a 15-20 GB process comfortably (maybe it has 32 or 64GB of RAM), also has 8-16 cores of CPU, and it is in that setting that the un-multi-threadedness is biting you as you can't put 8 or 16 parallel BUSCO runs on that server. You would have to run them one by one, wasting 15/16 of your CPU performance. You can't get servers with 256 GB RAM and 1 CPU core which would be optimal for throughput of a single-threaded high-RAM process
Hm I see... yes I read this in the 5.7.0 manual before, but as I haven't studied how metaeuk or augustus work, I couldn't decipher what the difference is between using a gene predictor or a gene mapper and what it meant with the stop codon. In the end, don't we just get a credence that a section of my input sequence is one of the busco genes or not, regardless of the technique?
Or are you saying that the specification of the exons can be unreliable, and point to sections that contain non-gene data and therefore you might get those stop codons inside the gene?
We want to do phylogenomics on nucleotide sequences, not AA sequences, and preferably with introns excluded (as these can diverge between species at a higher rate), so a decently trustworthy spec of the actual protein-coding sections per gene would be optimal, but I might have to play with metaeuk as well then if I read you right or at the very least do some post-processing on the whole mRNA nucleotide span specified in the output gff from miniprot?
Sorry if the last bit isn't strictly an issue with BUSCO, it's more of a pipeline question I guess, still could be useful to add to the doc in that case for use-case suggestions or something..
Hi Bjorn, the .fna
files were deliberately removed from the v5.7.0 release. After internal discussions we had concerns that providing nucleotide files might lead users to trust the sequences for downstream analysis, as you suggested. However, as Miniprot is a gene mapper, not a gene predictor, caution should be exercised with these sequences, which in many cases contain internal stop codons.
This is mentioned in the user guide: https://busco.ezlab.org/busco_userguide.html#eukaryota
If it is important for your analysis to have these sequences I recommend running the Metaeuk pipeline (--metaeuk
) or the Augustus (--augustus
) pipeline for more accurate gene predictions.
The 5.6.1 release gave me the nucleotide sequences (.fna) for the exons from the genes in the output folders, along with the .faa and .gff feature files. But the 5.7.0, run with the same options, only outputs .faa and .gff and no .fna.
I run it like this:
docker run -u $(id -u) -v $(pwd):/busco_wd ezlabgva/busco:v5.7.0_cv1 busco -m geno -c 24 -l lepidoptera_odb10 -i contigs.fasta
And in the busco_sequences/single_copy_busco_sequences I now don't get the .fna files as previously.
It does seem that the .gff feature file still points to the nucleotide sequence from the fasta input, so I can write a script that parses out the sequences like before, but would be good to clarify if this is the suggested workflow to detect the genes in my input (I'm doing phylogenomics after this so intend to pass the detected sequences into iq-tree eventually), like, the busco_sequences output is considered to be tool-dependent (miniprot vs metaeuk) and not a well-defined output?
For example a .gff file could have the following output, and in that case I can go into my contig NODE_2389 and extract each CDS region and concatenate the sequences to get what was in the .fna file previously? (For the 5.6.1 release I did the same checks, but there the .gff file had "exon" intervals, that if concatenated from the input source, gave the same data as in the .fna file)
NODE_2389_length_8483_cov_3.384368 miniprot mRNA 529 7905 2775 - . ID=MP035653;Rank=1;Identity=0.6859;Positive=0.8113;Target=550at7088_2 657 1353
NODE_2389_length_8483_cov_3.384368 miniprot CDS 7710 7905 352 - 0 Parent=MP035653;Rank=1;Identity=0.8788;Target=550at7088_2 657 721
NODE_2389_length_8483_cov_3.384368 miniprot CDS 6963 7131 271 - 2 Parent=MP035653;Rank=1;Identity=0.8036;Target=550at7088_2 722 777
NODE_2389_length_8483_cov_3.384368 miniprot CDS 6430 6577 223 - 1 Parent=MP035653;Rank=1;Identity=0.6939;Target=550at7088_2 778 827
NODE_2389_length_8483_cov_3.384368 miniprot CDS 6030 6236 304 - 0 Parent=MP035653;Rank=1;Identity=0.6957;Target=550at7088_2 828 896
NODE_2389_length_8483_cov_3.384368 miniprot CDS 5448 5634 276 - 0 Parent=MP035653;Rank=1;Identity=0.7143;Target=550at7088_2 897 958
NODE_2389_length_8483_cov_3.384368 miniprot CDS 4893 5069 258 - 2 Parent=MP035653;Rank=1;Identity=0.7119;Target=550at7088_2 959 1017
NODE_2389_length_8483_cov_3.384368 miniprot CDS 3023 3151 202 - 2 Parent=MP035653;Rank=1;Identity=0.8605;Target=550at7088_2 1018 1060
NODE_2389_length_8483_cov_3.384368 miniprot CDS 2655 2795 222 - 2 Parent=MP035653;Rank=1;Identity=0.8085;Target=550at7088_2 1061 1107
NODE_2389_length_8483_cov_3.384368 miniprot CDS 2054 2329 164 - 2 Parent=MP035653;Rank=1;Identity=0.4362;Target=550at7088_2 1108 1192
NODE_2389_length_8483_cov_3.384368 miniprot CDS 1558 1701 145 - 2 Parent=MP035653;Rank=1;Identity=0.5833;Target=550at7088_2 1193 1236
NODE_2389_length_8483_cov_3.384368 miniprot CDS 958 1172 199 - 2 Parent=MP035653;Rank=1;Identity=0.5493;Target=550at7088_2 1237 1308
NODE_2389_length_8483_cov_3.384368 miniprot CDS 529 663 159 - 0 Parent=MP035653;Rank=1;Identity=0.7111;Target=550at7088_2 1309 1353
One of the aspects we have not really highlighted about the Miniprot pipeline is that it takes considerably more memory than Metaeuk. This goes some way towards the lower run time, but if memory becomes an issue it may be best to revert to the Metaeuk pipeline with --metaeuk
.
Cool! Yeah also since a single run consumes maybe 15-20 GB of RAM, running a lot of BUSCO-runs in parallel for different genomes is sadly out of the question so that "trivial parallelization" that I mentioned above won't work. So you really want to get it done using as many threads as possible.
Hello,
I'm assessing genome completeness from a de novo genome assembly. When I run BUSCO --miniprot and compleasm there is a slight different between the results. Perhaps Do you know what's the reason for this?
dataset eurotiales_odb10
Busco result: C:95.8%[S:95.4%,D:0.4%],F:0.3%,M:3.9%,n:4191,E:35.9%
4012 Complete BUSCOs (C) (of which 1441 contain internal stop codons) | 3997 Complete and single-copy BUSCOs (S) | 15 Complete and duplicated BUSCOs (D) | 11 Fragmented BUSCOs (F) | 168 Missing BUSCOs (M) | 4191 Total BUSCO groups searched |
command used: busco -f --miniprot -c 32 -l eurotiales_odb10 -m genome -i assembly.fasta -o out_busco_miniprot
Compleasm result:
S:97.18%, 4073 D:0.12%, 5 F:0.12%, 5 I:0.00%, 0 M:2.58%, 108 N:4191
command used: compleasm run -t 32 -l eurotiales_odb10 -a assembly.fasta -o out_compleasm
Thank you very much
Pretty clear. Thank you very much.
Possibly, though I removed some multi-threaded post-processing from other areas of the code because the overhead of setting up processes was actually more costly than running on 1 cpu. This case might be different - I'll take a closer look and see what we can do.