Integrate best hits and task map generation into run rules
Problems
- double acc column in task map files
- confusing which files/snakemake pathways need best_hits and map generation rules
Changes
id2annot database map files
- remove double acc/ID column
- renamed headers of dbcan, pfam, tigrfam
- acc/ID now in all maps in column 1
- new id2annot.map.gz files in test resources
database migration v5 to v6
to ensure new id2annot files are created, the id2annot.map.gz files are deleted
Rules
moved best_hits and map generation rules in:
- run_hmmer.smk
- run_emapper.smk -> only for emapper_v1
- run_diamond.smk
task map creation functions are in utils/create_task_map.py
best_hits creation functions are in best_hits rules
Emapper v1
- og'.'prot / or 'NA.'prot terms (original in the first column, now replaced by prot column) are removed from id2annot emapper map and instead this terms are builded in the write_func_task_map_for_emapper_v1 (this terms are required for matching with the best_hits dict).
- added test vor emapper_v1 mapping
Example of id2annot map before:
- 0RT9A.9593.ENSGGOP00000011681 cellular processes and signaling Cytoskeleton 0RT9A neurofilament heavy polypeptide 9593.ENSGGOP00000011681
- 16AWT.9593.ENSGGOP00000011681 cellular processes and signaling Cytoskeleton 16AWT neurofilament heavy polypeptide 9593.ENSGGOP00000011681
- 0IGME.9593.ENSGGOP00000011681 cellular processes and signaling Cytoskeleton 0IGME Neurofilament 9593.ENSGGOP00000011681
- NA.9593.ENSGGOP00000011681 cellular processes and signaling Cytoskeleton 12SMI Neurofilament 9593.ENSGGOP00000011681
...
Example of d2annot map after:
- 9593.ENSGGOP00000011681 cellular processes and signaling Cytoskeleton 0RT9A neurofilament heavy polypeptide
- 9593.ENSGGOP00000011681 cellular processes and signaling Cytoskeleton 16AWT neurofilament heavy polypeptide
- 9593.ENSGGOP00000011681 cellular processes and signaling Cytoskeleton 0IGME Neurofilament
...
Edited by Juliane Schmachtenberg