BUSCO increases the number of missing genes after scaffolding.
I performed a scaffolding on my genome assembly. I have checked, the sequence before and after scaffolding has not changed, there are just contigs that have been linked together by a stretch of "N". I launched BUSCO on my genome before and after scaffolding and I find more missing genes after scaffolding, which I don't understand, this number should strictly be the same.
In my log, first, it's ok :
Before scaffolding :
INFO ****** Step 3/3, current time: 02/16/2018 16:13:09 ******
INFO Running HMMER to confirm orthology of predicted proteins:
INFO 02/16/2018 16:13:10 => 0% of predictions performed (2228 to be done)
INFO 02/16/2018 16:13:12 => 10% of predictions performed (249/2228 candidate proteins)
INFO 02/16/2018 16:13:14 => 20% of predictions performed (469/2228 candidate proteins)
INFO 02/16/2018 16:13:16 => 30% of predictions performed (691/2228 candidate proteins)
INFO 02/16/2018 16:13:19 => 40% of predictions performed (914/2228 candidate proteins)
INFO 02/16/2018 16:13:23 => 50% of predictions performed (1137/2228 candidate proteins)
INFO 02/16/2018 16:13:28 => 60% of predictions performed (1365/2228 candidate proteins)
INFO 02/16/2018 16:13:33 => 70% of predictions performed (1582/2228 candidate proteins)
INFO 02/16/2018 16:13:39 => 80% of predictions performed (1806/2228 candidate proteins)
INFO 02/16/2018 16:13:46 => 90% of predictions performed (2030/2228 candidate proteins)
INFO 02/16/2018 16:13:53 => 100% of predictions performed
INFO Results:
INFO **C:93.8%[S:91.9%,D:1.9%],F:2.5%,M:3.7%,n:1440**
INFO 1351 Complete BUSCOs (C)
INFO 1324 Complete and single-copy BUSCOs (S)
INFO 27 Complete and duplicated BUSCOs (D)
INFO 36 Fragmented BUSCOs (F)
INFO 53 Missing BUSCOs (M)
INFO 1440 Total BUSCO groups searched
After scaffolding :
INFO ****** Step 3/3, current time: 02/16/2018 16:54:32 ******
INFO Running HMMER to confirm orthology of predicted proteins:
INFO 02/16/2018 16:54:33 => 0% of predictions performed (2225 to be done)
INFO 02/16/2018 16:54:36 => 10% of predictions performed (246/2225 candidate proteins)
INFO 02/16/2018 16:54:38 => 20% of predictions performed (469/2225 candidate proteins)
INFO 02/16/2018 16:54:40 => 30% of predictions performed (690/2225 candidate proteins)
INFO 02/16/2018 16:54:43 => 40% of predictions performed (916/2225 candidate proteins)
INFO 02/16/2018 16:54:47 => 50% of predictions performed (1135/2225 candidate proteins)
INFO 02/16/2018 16:54:52 => 60% of predictions performed (1361/2225 candidate proteins)
INFO 02/16/2018 16:54:57 => 70% of predictions performed (1581/2225 candidate proteins)
INFO 02/16/2018 16:55:03 => 80% of predictions performed (1804/2225 candidate proteins)
INFO 02/16/2018 16:55:10 => 90% of predictions performed (2028/2225 candidate proteins)
INFO 02/16/2018 16:55:16 => 100% of predictions performed
INFO Results:
INFO **C:94.0%[S:92.2%,D:1.8%],F:2.2%,M:3.8%,n:1440**
INFO 1353 Complete BUSCOs (C)
INFO 1327 Complete and single-copy BUSCOs (S)
INFO 26 Complete and duplicated BUSCOs (D)
INFO 32 Fragmented BUSCOs (F)
INFO 55 Missing BUSCOs (M)
INFO 1440 Total BUSCO groups searched
But at the end, these results are different between my two assemblies :
Before scaffolding :
INFO ****** Step 3/3, current time: 02/16/2018 16:33:58 ******
INFO Running HMMER to confirm orthology of predicted proteins:
INFO 02/16/2018 16:33:58 => 0% of predictions performed (158 to be done)
INFO 02/16/2018 16:33:58 => 10% of predictions performed (21/158 candidate proteins)
INFO 02/16/2018 16:33:58 => 20% of predictions performed (34/158 candidate proteins)
INFO 02/16/2018 16:33:58 => 30% of predictions performed (49/158 candidate proteins)
INFO 02/16/2018 16:33:58 => 40% of predictions performed (70/158 candidate proteins)
INFO 02/16/2018 16:33:58 => 50% of predictions performed (81/158 candidate proteins)
INFO 02/16/2018 16:33:58 => 60% of predictions performed (97/158 candidate proteins)
INFO 02/16/2018 16:33:58 => 70% of predictions performed (113/158 candidate proteins)
INFO 02/16/2018 16:33:58 => 80% of predictions performed (129/158 candidate proteins)
INFO 02/16/2018 16:33:58 => 90% of predictions performed (144/158 candidate proteins)
INFO 02/16/2018 16:33:58 => 100% of predictions performed
INFO Results:
INFO **C:95.6%[S:93.6%,D:2.0%],F:1.8%,M:2.6%,n:1440**
INFO 1377 Complete BUSCOs (C)
INFO 1348 Complete and single-copy BUSCOs (S)
INFO 29 Complete and duplicated BUSCOs (D)
INFO 26 Fragmented BUSCOs (F)
INFO 37 Missing BUSCOs (M)
INFO 1440 Total BUSCO groups searched
After scaffolding :
INFO ****** Step 3/3, current time: 02/16/2018 17:14:17 ******
INFO Running HMMER to confirm orthology of predicted proteins:
INFO 02/16/2018 17:14:17 => 0% of predictions performed (0 to be done)
INFO 02/16/2018 17:14:17 => 100% of predictions performed
INFO Results:
INFO **C:94.0%[S:92.2%,D:1.8%],F:2.2%,M:3.8%,n:1440**
INFO 1353 Complete BUSCOs (C)
INFO 1327 Complete and single-copy BUSCOs (S)
INFO 26 Complete and duplicated BUSCOs (D)
INFO 32 Fragmented BUSCOs (F)
INFO 55 Missing BUSCOs (M)
INFO 1440 Total BUSCO groups searched
I don't know why at this last step, I have "0% of predictions performed" for my scaffolding ? Because at the step2 of phase 2, BUSCO founds candidate regions ... :
INFO ****** Step 2/3, current time: 02/16/2018 17:03:04 ******
INFO Training Augustus using Single-Copy Complete BUSCOs:
INFO 02/16/2018 17:03:04 => Converting predicted genes to short genbank files...
INFO 02/16/2018 17:14:12 => All files converted to short genbank files, now running the training scripts...
INFO Pre-Augustus scaffold extraction...
INFO Re-running Augustus with the new metaparameters, number of target BUSCOs: 87
INFO 02/16/2018 17:14:14 => 0% of predictions performed (156 to be done)
INFO 02/16/2018 17:14:14 => 10% of predictions performed (21/156 candidate regions)
INFO 02/16/2018 17:14:14 => 20% of predictions performed (36/156 candidate regions)
INFO 02/16/2018 17:14:14 => 30% of predictions performed (49/156 candidate regions)
INFO 02/16/2018 17:14:15 => 40% of predictions performed (64/156 candidate regions)
INFO 02/16/2018 17:14:15 => 50% of predictions performed (81/156 candidate regions)
INFO 02/16/2018 17:14:15 => 60% of predictions performed (96/156 candidate regions)
INFO 02/16/2018 17:14:15 => 70% of predictions performed (111/156 candidate regions)
INFO 02/16/2018 17:14:16 => 80% of predictions performed (128/156 candidate regions)
INFO 02/16/2018 17:14:16 => 90% of predictions performed (142/156 candidate regions)
INFO 02/16/2018 17:14:16 => 100% of predictions performed
INFO Extracting predicted proteins...
Do you know why I'm getting this difference? Is it a BUSCO problem that does not correctly identify the genes on my assembly after scaffolding? Or is it a problem at the scaffolding step ?
Best, Amandine