BUSCO is excluding good matches
I have a de novo transcriptome assembly, generated using Trinity. I filtered my assembly as follows:
- Filter for transcripts > 500 nt
- Select only the longest isoform for each gene
- Run TransDecoder.LongOrfs followed by TransDecoder.Predict to predict CDS and peptide sequences
I then ran BUSCO/5.0.0 on the output from steps 2 and 3 (Trinity_500_longestisoform.fasta and Trinity_500_longestisoform.fasta.transdecoder.cds, respectively) using the following commands:
busco -i <output from step 2 or 3> -o <output basename>_BUSCO-euk --config path/to/config.ini --auto-lineage-euk -m transcriptome --update-data -c 8
I got the following results:
$ cat short_summary.generic.eukaryota_odb10.Trinity_500_longestisoform_BUSCO-euk.txt
C:89.8%[S:89.0%,D:0.8%],F:6.7%,M:3.5%,n:255 229 Complete BUSCOs (C) 227 Complete and single-copy BUSCOs (S) 2 Complete and duplicated BUSCOs (D) 17 Fragmented BUSCOs (F) 9 Missing BUSCOs (M) 255 Total BUSCO groups searched
$ cat short_summary.generic.eukaryota_odb10.Trinity_500_LI_LO_TdP_cds_BUSCO-euk.txt
C:52.2%[S:51.4%,D:0.8%],F:33.7%,M:14.1%,n:255 133 Complete BUSCOs (C) 131 Complete and single-copy BUSCOs (S) 2 Complete and duplicated BUSCOs (D) 86 Fragmented BUSCOs (F) 36 Missing BUSCOs (M) 255 Total BUSCO groups searched
It seems concerning that there is a 38% drop in 'complete' BUSCOs between the two files. Initially I thought there was a problem with TransDecoder, where good transcript sequences were being thrown out during CDS prediction. However, some digging revealed that the issue is with BUSCO. I generated a list of the BUSCOs marked as 'missing' from Trinity_500_longestisoform.fasta.transdecoder.cds but 'present' in Trinity_500_longestisoform.fasta:
$ head BUSCOs_missing_after_TransDecoder_fulllist.txt
325552at2759 Complete TRINITY_DN28172_c0_g1_i4:26-2103 428.3 456 331411at2759 Complete TRINITY_DN31207_c0_g1_i6:273-3076 316.0 397 345441at2759 Fragmented TRINITY_DN36570_c0_g1_i2:536-1528 296.7 247 388820at2759 Complete TRINITY_DN32644_c0_g1_i5:121-3078 540.1 503 604979at2759 Complete TRINITY_DN35968_c0_g1_i10:3-2091 431.5 284 674169at2759 Complete TRINITY_DN34634_c0_g1_i13:204-2719 155.2 247 679187at2759 Complete TRINITY_DN34722_c0_g1_i1:172-3261 320.7 384 679771at2759 Complete TRINITY_DN31155_c0_g1_i6:79-1581 102.9 177 905026at2759 Complete TRINITY_DN29245_c0_g1_i2:55-1705 350.9 403 939345at2759 Complete TRINITY_DN25147_c0_g1_i12:160-1616 82.4 113
When I search for these sequences in Trinity_500_longestisoform.fasta.transdecoder.cds I found that they are present:
$ grep 'TRINITY_DN28172_c0_g1_i4' Trinity_500_longestisoform.fasta.transdecoder.cds
TRINITY_DN28172_c0_g1_i4|m.9259 TRINITY_DN28172_c0_g1_i4|g.9259 ORF TRINITY_DN28172_c0_g1_i4|g.9259 TRINITY_DN28172_c0_g1_i4|m.9259 type:complete len:123 (+) TRINITY_DN28172_c0_g1_i4:913-1281(+) TRINITY_DN28172_c0_g1_i4|m.9257 TRINITY_DN28172_c0_g1_i4|g.9257 ORF TRINITY_DN28172_c0_g1_i4|g.9257 TRINITY_DN28172_c0_g1_i4|m.9257 type:complete len:177 (+) TRINITY_DN28172_c0_g1_i4:1592-2122(+)
Further, when I BLAST-search some of these coding sequences, I get pretty good hits matching the BUSCO that they should belong to (in this case, 325552at2759):

Any ideas what's going on here? Thank you! Emily