BUSCO results on same sequence differ depending on context
Hello,
We notice that results in the 'full table' output of BUSCO differ depending on the presence or absence of other sequences in the input data. More specifically, we conducted the following experiment.
We ran BUSCO in its 'genome' mode using the plants BUSCOs set twice. In the first run the input fasta contained a single sequence record. Results look like this:
EOG09360916 Complete scaffold32 301722 307462 633.1 317 EOG093605DD Complete scaffold32 1226342 1233987 374.9 687 EOG09360D08 Complete scaffold32 1226342 1233987 378.6 706 EOG09360DJL Complete scaffold32 1226342 1233987 341.4 788
In the second run, the input included the same sequence record, on top of multiple other records. This time the result (for the same 4 BUSCOs) looks like this:
EOG09360916 Complete scaffold32 301722 307462 633.1 317 EOG093605DD Complete scaffold162 2138123 2141851 775.0 432 EOG09360D08 Complete scaffold648 6325439 6327663 271.0 276 EOG09360DJL Complete scaffold689 58688227 58690205 555.9 299
It is unclear to us why the context in which the sequence record appears affects the result in this way. For instance, if BUSCO EOG093605DD is found both in scaffold32 and in scaffold162, why is it not listed as 'duplicated'? Another concern - it looks like EOG09360D08 is better represented on scaffold32 than on scaffold162 (better score and alignment length), so why was the mapping to scaffold162 preferred in the second result?