lineage items

The new lineage integration is very nice. I had some suggestions based a first time user's issues:

(1) Misspelling the -l lineage name, e.g. -l avess returns a confusing and overwhelming error message due to a KeyError on line 103 in BuscoDownloadManager.py: latest_update = type(self).version_files[data_basename][0]

perhaps enhance original lines 149-150 of the get(self, data_name, category) routine in BuscoDownloadManager.py:

    data_basename = os.path.basename(data_name)
    local_filepath = os.path.join(self.local_download_path, category, data_basename)

with something like:

    data_basename = os.path.basename(data_name)
    if not (data_basename in type(self).version_files): # check for valid lineage name
        raise SystemExit("{} is not a valid lineage name\n".format(data_basename))

    local_filepath = os.path.join(self.local_download_path, category, data_basename)

(2) I was hoping to download the lineages using the built-in procedure without actually performing the BUSCO assembly run. I tried busco -l aves and that gives lots of errors about -i not being there, unfortunately with the configparser throwing KeyError due to ;in= in the cfg file and such it is a lot to parse.

Anyway, I found this worked nicely: making an empty file, empty.fa, and executing the following:

busco -l aves -i empty.fa -o test -f

Still would be a nice feature, perhaps asking the user if that was their intent when the only args are -l <lineage>.

(3a) in the tetrapoda lineage there is a softlink to a file in its originating environment that does not exist in the download

tetrapoda_odb10 -> ../../ALL_ANCESTRAL_REWORK/tetrapoda_odb10

(3b) also I get this warning on the tetrapoda lineage set, is this something to worry about:

 WARNING:busco.BuscoTools        The BUSCO ID(s) ['180491at32523', '119144at32523', '265941at32523'] were not found in the file ancestral_variants

(3c) DEBUG logging on, my busco.log file is 9.6M bytes 32218 lines, only 186 of those 32218 do not begin with DEBUG

(4) Problem with the v10 lineage urls:

https://www.orthodb.org/?query=66at32523 **works**
https://www.orthodb.org/v9/index.html?query=EOG090703OI **works**

https://www.orthodb.org/v10/index.html?query=66at32523 **does not work**

it is the last format, however, that is in all of the files, like the OrthoDB url field in full_table.tsv

if you can create a v10 direcory underneath the orthodb.org main dir and in this dir softlink to ../index.html this might work. keeping the v10 separation makes sense if orthodb.org/?query always goes to the current version then the urls would need to have the v10 distinction when v11 is introduced.

(5) number_of_species line and the number_of_BUSCOs line in the lineage dataset.cfg are often reversed. these numbers are carried along in several of the reports I believe, so it is worth fixing them. here is a shell script with awk I used to fix that where appropriate on my downloads:

$ cat chk_species_busco.sh

#!/bin/bash
# fixup dataset.cfg value mix ups by flipping count number if BUSCOs is less than number of species

dataset=$1
[ -z $dataset ] && dataset=dataset.cfg

rm -f dtmp
awk -F"=" '

   {contents[++ln]=$0}
   !/^number/{next}

   /BUSCO/  {busco_ln=ln;   busco_str=$1;   busco_val=$2; next}
   /species/{species_ln=ln; species_str=$1; species_val=$2}

   END {
      if (busco_val < species_val) {
         for(i=1; i<=ln;i++)
            if(i == busco_ln)
               print busco_str "=" species_val >"dtmp"
            else if (i == species_ln)
               print species_str "=" busco_val >"dtmp"
            else
               print contents[i] >"dtmp"
      }
      else print "Nothing to do" > "/dev/stderr"
 } ' $dataset

[ -f dtmp ] && mv dtmp $dataset