too many ancestral_variants_missing_and_frag_rerun records created (glires lineage)
You've probably already got this one fixed in 4.1, but passing along in case it is new:
At least for glires in phase 2, too many ancestral_variants_missing_and_frag_rerun records created. Thought it had something to do with short ones e.g. 1at314147 matching others like 21at314147 but pattern didn't hold.
Phase 1 run had 444 Missing and 101 fragmented. busco.log says 545 records with their variants are to be extracted, however, the total number of records in ancestral_variants_missing_and_frag_rerun is 29,401 referencing variants of 3876 original records. Also looks like phase 2 augustus ends early.
.../run_glires_odb10/blast_output$ grep "^>" ancestral_variants_missing_and_frag_rerun -c
29041
...run_glires_odb10/blast_output$ grep "^>" ancestral_variants_missing_and_frag_rerun | cut -f1 -d" " |uniq |wc -l
3876
with any single original record having no more than 10 variants (average in glires file is closer to 7) we shouldn't have more than 545*10 or 5,450 total instead of the 29,401.
I saved the Phase 1 files in a dir, and the first 20 in Phase 1 full_table.tsv show:
.../run_glires_odb10/blast_output$ grep "[1-9]*at" ../Phase1/full_table.tsv | cut -f1-2 |head -20
1at314147 Fragmented
3at314147 Missing
9at314147 Complete
14at314147 Complete
18at314147 Complete
21at314147 Complete
22at314147 Complete
23at314147 Complete
27at314147 Missing
33at314147 Missing
37at314147 Complete
38at314147 Complete
42at314147 Complete
44at314147 Complete
45at314147 Missing
47at314147 Complete
49at314147 Complete
51at314147 Complete
56at314147 Complete
57at314147 Complete
here's the first 20 name sorted record types in ancestral_variants_missing_and_frag_rerun with the counts of variants in field 1 (_0 _1 etc); I have put the status next to a few to see some issues. Some that should be there like 27at314147 are not and many Complete and Duplicated are there.
.../run_glires_odb10/blast_output$ grep "^>" ancestral_variants_missing_and_frag_rerun |cut -f1 -d" " |uniq -c |sort -V |head -20
10 >1at314147 # Fragmented
10 >3at314147 # Missing
10 >21at314147 # Complete
10 >23at314147 # Complete
10 >33at314147 # Missing
10 >77at314147
10 >83at314147
10 >103at314147
10 >121at314147
10 >151at314147
10 >171at314147
10 >173at314147
10 >192at314147
10 >207at314147
10 >224at314147
10 >227at314147
10 >252at314147
10 >281at314147
10 >291at314147
10 >321at314147
best and thank you,
Jim Henderson
I've put log below where I did grep -v "^DEBUG" busco.log
INFO:busco.BuscoAnalysis Running BUSCO using lineage dataset glires_odb10 (eukaryota, 2019-12-17)
INFO:busco.BuscoTools Creating BLAST database with input file
INFO:busco.Toolset Running 1 job(s) on makeblastdb
INFO:busco.Toolset [makeblastdb] 1 of 1 task(s) completed
INFO:busco.BuscoTools Running a BLAST search for BUSCOs against created database
INFO:busco.Toolset Running 1 job(s) on tblastn
INFO:busco.Toolset [tblastn] 1 of 1 task(s) completed
INFO:busco.GenomeAnalysis Running Augustus gene predictor on BLAST search results.
INFO:busco.BuscoTools Running Augustus prediction using human as species:
INFO:busco.BuscoTools Additional parameters for Augustus are --singlestrand=true:
INFO:busco.Toolset Running 55578 job(s) on augustus
INFO:busco.Toolset [augustus] 5558 of 55578 task(s) completed
INFO:busco.Toolset [augustus] 11116 of 55578 task(s) completed
INFO:busco.Toolset [augustus] 16674 of 55578 task(s) completed
INFO:busco.Toolset [augustus] 22232 of 55578 task(s) completed
INFO:busco.Toolset [augustus] 27790 of 55578 task(s) completed
INFO:busco.Toolset [augustus] 33347 of 55578 task(s) completed
INFO:busco.Toolset [augustus] 38905 of 55578 task(s) completed
INFO:busco.Toolset [augustus] 44463 of 55578 task(s) completed
INFO:busco.Toolset [augustus] 50021 of 55578 task(s) completed
INFO:busco.Toolset [augustus] 55578 of 55578 task(s) completed
INFO:busco.BuscoTools Extracting predicted proteins...
INFO:busco.BuscoAnalysis ***** Run HMMER on gene sequences *****
INFO:busco.Toolset Running 54072 job(s) on hmmsearch
INFO:busco.Toolset [hmmsearch] 43258 of 54072 task(s) completed
INFO:busco.Toolset [hmmsearch] 54072 of 54072 task(s) completed
C:96.1%[S:93.4%,D:2.7%],F:0.7%,M:3.2%,n:13798
13253 Complete BUSCOs (C)
12886 Complete and single-copy BUSCOs (S)
367 Complete and duplicated BUSCOs (D)
101 Fragmented BUSCOs (F)
444 Missing BUSCOs (M)
13798 Total BUSCO groups searched
INFO:busco.GenomeAnalysis Starting second step of analysis. The gene predictor Augustus is retrained using the results from the initial run to yield more accurate results.
INFO:busco.BuscoTools Extracting missing and fragmented buscos from the file ancestral_variants...
WARNING:busco.BuscoTools The BUSCO ID(s) ['17816at314147', '39247at314147', '20456at314147', '45451at314147', '17531at314147', '5651at314147', '19604at314147', '7735at314147', '16561at314147', '18006at314147', '18792at314147', '16812at314147', '45368at314147', '31533at314147', '23875at314147', '45101at314147', '44935at314147', '41845at314147', '39838at314147', '17181at314147', '36941at314147'] were not found in the file ancestral_variants
INFO:busco.BuscoTools Running a BLAST search for BUSCOs against created database
INFO:busco.Toolset [tblastn] 1 of 1 task(s) completed
INFO:busco.GenomeAnalysis Training Augustus using Single-Copy Complete BUSCOs:
INFO:busco.GenomeAnalysis Converting predicted genes to short genbank files
INFO:busco.Toolset Running 12886 job(s) on gff2gbSmallDNA.pl
INFO:busco.Toolset [gff2gbSmallDNA.pl] 1289 of 12886 task(s) completed
INFO:busco.Toolset [gff2gbSmallDNA.pl] 2578 of 12886 task(s) completed
INFO:busco.Toolset [gff2gbSmallDNA.pl] 3866 of 12886 task(s) completed
INFO:busco.Toolset [gff2gbSmallDNA.pl] 5155 of 12886 task(s) completed
INFO:busco.Toolset [gff2gbSmallDNA.pl] 6443 of 12886 task(s) completed
INFO:busco.Toolset [gff2gbSmallDNA.pl] 7732 of 12886 task(s) completed
INFO:busco.Toolset [gff2gbSmallDNA.pl] 9021 of 12886 task(s) completed
INFO:busco.Toolset [gff2gbSmallDNA.pl] 10309 of 12886 task(s) completed
INFO:busco.Toolset [gff2gbSmallDNA.pl] 11598 of 12886 task(s) completed
INFO:busco.Toolset [gff2gbSmallDNA.pl] 12886 of 12886 task(s) completed
INFO:busco.GenomeAnalysis All files converted to short genbank files, now running the training scripts
INFO:busco.Toolset Running 1 job(s) on new_species.pl
INFO:busco.Toolset [new_species.pl] 1 of 1 task(s) completed
INFO:busco.Toolset Running 1 job(s) on etraining
INFO:busco.Toolset [etraining] 1 of 1 task(s) completed
INFO:busco.GenomeAnalysis Re-running Augustus with the new metaparameters, number of target BUSCOs: 545
INFO:busco.GenomeAnalysis Running Augustus gene predictor on BLAST search results.
INFO:busco.BuscoTools Running Augustus prediction using BUSCO_busco4_Nfusc_glires as species:
INFO:busco.BuscoTools Additional parameters for Augustus are --singlestrand=true:
INFO:busco.Toolset [augustus] 2450 of 3062 task(s) completed
INFO:busco.Toolset [augustus] 2756 of 3062 task(s) completed
INFO:busco.Toolset [augustus] 3062 of 3062 task(s) completed
INFO:busco.BuscoTools Extracting predicted proteins...
INFO:busco.BuscoAnalysis ***** Run HMMER on gene sequences *****
INFO:busco.BuscoAnalysis Results: C:96.1%[S:93.4%,D:2.7%],F:0.7%,M:3.2%,n:13798
INFO:busco.BuscoRunner
--------------------------------------------------
|Results from dataset glires_odb10 |
--------------------------------------------------
|C:96.1%[S:93.4%,D:2.7%],F:0.7%,M:3.2%,n:13798 |
|13253 Complete BUSCOs (C) |
|12886 Complete and single-copy BUSCOs (S) |
|367 Complete and duplicated BUSCOs (D) |
|101 Fragmented BUSCOs (F) |
|444 Missing BUSCOs (M) |
|13798 Total BUSCO groups searched |
--------------------------------------------------
INFO:busco.BuscoRunner BUSCO analysis done with WARNING(s). Total running time: 179663 seconds