Overlapping BUSCOs and discrepancy between gff/fasta and full_table.tsv
Hi, I have a question regarding overlapping BUSCO-genes. I ran 12 genome assemblies to extract single copy genes and their positions for phylogenic and other downstream analyses. The assemblies are of very high qualities, 98-99% complete BUSCOs. I found around 40 instances of genes clustering together as orthogroups in at least one taxa and when I looked closer these genes have overlapping physical positions, hence their matching sequence alignment. I have included one example of the headings of the fasta files and also positions of the genes in the gff-files of two overlapping genes. However, in the full_table.tsv they are not overlapping. In the example case here the interval begins at the start of the exon_4 and goes to the end of the gene, but still labels the gene as complete. Is this normal behaviour of BUSCO, I guess I could expect a number of overlapping genes? But how does BUSCO handle these? And are values in the full_table.tsv using a different source of information compared to the gff and subsequently the output fasta? It would be nice to have some indication in the full_table.tsv that these genes are overlapping or truncated? Or am I doing something wrong, or misinterpreting the output? I would be very happy if you could help me understand how to handle these genes, and the discrepancy between the gff/fasta and the full_table.tsv.
Many thanks! /Karin
Fasta headers: run_lepidoptera_odb10/busco_sequences/single_copy_busco_sequences/10646at7088.fna:
ilMelMeno1_1_14349046-14356502
run_lepidoptera_odb10/busco_sequences/single_copy_busco_sequences/3727at7088.fna:
ilMelMeno1_1_14349046-14364888
run_lepidoptera_odb10/full_table.tsv:
10646at7088 Complete ilMelMeno1_1 14349046 14356502 + 375.1 193 https://www.orthodb.org/v10?query=10646at7088 acyl-protein thioesterase 1
run_lepidoptera_odb10/full_table.tsv:
3727at7088 Complete ilMelMeno1_1 14360772 14364888 + 1211.7 532 https://www.orthodb.org/v10?query=3727at7088 sodium/hydrogen exchanger 8
run_lepidoptera_odb10/busco_sequences/single_copy_busco_sequences/10646at7088.gff:
ilMelMeno1_1 MetaEuk gene 14349047 14356503 446 + . Target_ID=10646at7088;TCS_ID=10646at7088|ilMelMeno1_1|+|14349046
run_lepidoptera_odb10/busco_sequences/single_copy_busco_sequences/3727at7088.gff:
ilMelMeno1_1 MetaEuk gene 14349047 14364889 1454 + . Target_ID=3727at7088_0;TCS_ID=3727at7088_0|ilMelMeno1_1|+|14349046
Busco-version: busco/5.4.6 with the lepidoptera_odb10
busco command busco -i $input_dir -l lepidoptera_odb10 -m geno -o $output_dir -c 12
In the log I found this output but I am not sure if this is the same "overlapping" as I see, or if it means something else?
2023-09-22 15:08:44 INFO:busco.analysis.GenomeAnalysis Validating exons and removing overlapping matches
2023-09-22 15:08:54 INFO:busco.analysis.GenomeAnalysis 197 candidate overlapping regions found