lineage items
The new lineage integration is very nice. I had some suggestions based a first time user's issues:
(1) Misspelling the -l lineage name, e.g. -l avess returns a confusing and overwhelming error message due to a KeyError on line 103 in BuscoDownloadManager.py: latest_update = type(self).version_files[data_basename][0]
perhaps enhance original lines 149-150 of the get(self, data_name, category) routine in BuscoDownloadManager.py:
data_basename = os.path.basename(data_name)
local_filepath = os.path.join(self.local_download_path, category, data_basename)
with something like:
data_basename = os.path.basename(data_name)
if not (data_basename in type(self).version_files): # check for valid lineage name
raise SystemExit("{} is not a valid lineage name\n".format(data_basename))
local_filepath = os.path.join(self.local_download_path, category, data_basename)
(2) I was hoping to download the lineages using the built-in procedure without actually performing the BUSCO assembly run.
I tried busco -l aves and that gives lots of errors about -i not being there, unfortunately with the configparser
throwing KeyError due to ;in= in the cfg file and such it is a lot to parse.
Anyway, I found this worked nicely: making an empty file, empty.fa, and executing the following:
busco -l aves -i empty.fa -o test -f
Still would be a nice feature, perhaps asking the user if that was their intent when the only args are -l <lineage>.
(3a) in the tetrapoda lineage there is a softlink to a file in its originating environment that does not exist in the download
tetrapoda_odb10 -> ../../ALL_ANCESTRAL_REWORK/tetrapoda_odb10
(3b) also I get this warning on the tetrapoda lineage set, is this something to worry about:
WARNING:busco.BuscoTools The BUSCO ID(s) ['180491at32523', '119144at32523', '265941at32523'] were not found in the file ancestral_variants
(3c) DEBUG logging on, my busco.log file is 9.6M bytes 32218 lines, only 186 of those 32218 do not begin with DEBUG
(4) Problem with the v10 lineage urls:
https://www.orthodb.org/?query=66at32523 **works**
https://www.orthodb.org/v9/index.html?query=EOG090703OI **works**
https://www.orthodb.org/v10/index.html?query=66at32523 **does not work**
it is the last format, however, that is in all of the files, like the OrthoDB url field in full_table.tsv
if you can create a v10 direcory underneath the orthodb.org main dir and in this dir softlink to ../index.html this might work. keeping the v10 separation makes sense if orthodb.org/?query always goes to the current version then the urls would need to have the v10 distinction when v11 is introduced.
(5) number_of_species line and the number_of_BUSCOs line in the lineage dataset.cfg are often reversed. these numbers are carried along in several of the reports I believe, so it is worth fixing them. here is a shell script with awk I used to fix that where appropriate on my downloads:
$ cat chk_species_busco.sh
#!/bin/bash
# fixup dataset.cfg value mix ups by flipping count number if BUSCOs is less than number of species
dataset=$1
[ -z $dataset ] && dataset=dataset.cfg
rm -f dtmp
awk -F"=" '
{contents[++ln]=$0}
!/^number/{next}
/BUSCO/ {busco_ln=ln; busco_str=$1; busco_val=$2; next}
/species/{species_ln=ln; species_str=$1; species_val=$2}
END {
if (busco_val < species_val) {
for(i=1; i<=ln;i++)
if(i == busco_ln)
print busco_str "=" species_val >"dtmp"
else if (i == species_ln)
print species_str "=" busco_val >"dtmp"
else
print contents[i] >"dtmp"
}
else print "Nothing to do" > "/dev/stderr"
} ' $dataset
[ -f dtmp ] && mv dtmp $dataset