Erroneous HMMER file counting
First of all, just let me say thank you to the whole team for making available such awesome and useful software!
A colleague of mine was running BUSCO without a hitch, but I recently tried running it with 16 cores (he was only using 1) and ran in to some issues. I get this error when running BUSCO:
INFO 03/22/2017 21:48:27 => 100% of predictions performed (364/329 candidate proteins) Exception in thread hmmer-final.scaffolds_busco_output_mt-10: Traceback (most recent call last): File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner self.run() File "/opt/busco/BUSCO.py", line 192, in run self.analysis._process_hmmer_tasks() File "/opt/busco/BUSCO.py", line 1876, in _process_hmmer_tasks if state > self.slate[-1]: IndexError: list index out of range
and I get about one of those errors per thread. Note that it says 364/329 candidate proteins. I looked in to it a little bit, and it looks like the _process_hmmer_tasks
function counts the number of output files generated by HMMER jobs to determine the proportion of jobs completed. I don't really know enough about Python multithreading or your job queuing setup, but it looks like in some cases the os.listdir
command used to count the files lists the same file multiple times.
I added some code to print out the list of files and here's part of the output:
INFO FILES: ['EOG0937004M.out.1', 'EOG09370082.out.1', 'EOG093700N7.out.1', 'EOG0937017X.out.1', 'EOG0937017X.out.2', 'EOG0937018Z.out.1', 'EOG0937019B.out.1', 'EOG093701EE.out.1', 'EOG093701EE.out.1'
note that the last file is repeated in the list twice. I'm not quite sure what the proper fix is for this, but replacing the original file counting line with these lines:
files = [name for name in os.listdir('%shmmer_output' % self.mainout) if os.path.isfile(os.path.join('%shmmer_output' % self.mainout, name))] files = list(set(files)) check = len(files)
to make a unique list of files gets rid of the crash and gives the same results as running with a single core. I'd be very interested to hear what your team's diagnosis of the issue is and look forward to using BUSCO some more!
Thanks, Robert