parseclustering fails on sequences with `|` in their IDs
20:10:34 05-07:DEBUG:ParseClusters:Identifying unannotated proteins (signified by absence of a gene name)
20:10:34 05-07:INFO:ParseClusters:Parsing /tmp/342463.1.main.q/tmp273x7xf3/clustering/clustered_proteins
Traceback (most recent call last):
File "/path/to/conda/envs/hybran/bin/hybran", line 8, in <module>
sys.exit(main())
^^^^^^
File "/path/to/conda/envs/hybran/lib/python3.11/site-packages/hybran/hybran.py", line 336, in main
run.clustering(all_genomes=list(set(sorted(all_genomes))),
File "/path/to/conda/envs/hybran/lib/python3.11/site-packages/hybran/run.py", line 107, in clustering
parseClustering.parseClustersUpdateGBKs(target_gffs=all_genomes,
File "/path/to/conda/envs/hybran/lib/python3.11/site-packages/hybran/parseClustering.py", line 769, in parseClustersUpdateGBKs
clusters = parse_clustered_proteins(clustered_proteins=clusters,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/conda/envs/hybran/lib/python3.11/site-packages/hybran/parseClustering.py", line 114, in parse_clustered_proteins
cluster_list.append(gffs[isolate][gene_id])
KeyError: 'H37Rv_KOmetA_0|H37Rv_KOmetA_0.tig00000902|quiver|quiver'
The sequence ID of this genome's contig is tig00000902|quiver|quiver|quiver
and the sample name is H37Rv_KOmetA_0
. parseClustering is using |
as a delimiter between the sample name and the gene's feature ID (which also includes the sample name, but that's not the problem) and this is getting confused by the presence of |
in the sequence ID.