geNomad Creating Symlinks and Deleting geNomad Database Information
Hello,
I am running the following on a cluster environment, invoking a SLURM array for parallel processing of the input assemblies:
mvip MVP_01_run_genomad_checkv -i ${BASE_DIR}/MVP \
-m ${BASE_DIR}/MVP/MMSP_MVP_metadata.txt \
--genomad_relaxed \
--sample_group ${SLURM_ARRAY_TASK_ID} \
--threads ${SLURM_CPUS_PER_TASK}
When executing this module in parallel, I consistently receive errors like the following:
Traceback (most recent call last):
File "/.conda/envs/mvip_env/bin/genomad", line 10, in <module>
sys.exit(cli())
File "/.conda/envs/mvip_env/lib/python3.8/site-packages/rich_click/rich_command.py", line 367, in __call__
return super().__call__(*args, **kwargs)
File "/.conda/envs/mvip_env/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/.conda/envs/mvip_env/lib/python3.8/site-packages/rich_click/rich_command.py", line 152, in main
rv = self.invoke(ctx)
File "/.conda/envs/mvip_env/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/.conda/envs/mvip_env/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/.conda/envs/mvip_env/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/.conda/envs/mvip_env/lib/python3.8/site-packages/click/decorators.py", line 33, in new_func
return f(get_current_context(), *args, **kwargs)
File "/.conda/envs/mvip_env/lib/python3.8/site-packages/genomad/cli.py", line 1240, in end_to_end
ctx.invoke(
File "/.conda/envs/mvip_env/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/.conda/envs/mvip_env/lib/python3.8/site-packages/genomad/cli.py", line 441, in annotate
genomad.annotate.main(
File "/.conda/envs/mvip_env/lib/python3.8/site-packages/genomad/modules/annotate.py", line 168, in main
database_obj = database.Database(database_path)
File "/.conda/envs/mvip_env/lib/python3.8/site-packages/genomad/database.py", line 10, in __init__
with open(database_directory / "version.txt") as fin:
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/MVP/00_DATABASES/genomad_db/version.txt'
Traceback (most recent call last):
File "/.conda/envs/mvip_env/bin/mvip", line 10, in <module>
sys.exit(cli())
File "/.conda/envs/mvip_env/lib/python3.8/site-packages/mvip/cli.py", line 155, in cli
args["func"](args)
File "/.conda/envs/mvip_env/lib/python3.8/site-packages/mvip/modules/MVP_01_run_genomad_checkv.py", line 461, in main
virus_genomad_file = glob.glob(os.path.join(args["input"], '01_GENOMAD', str(sample_name), f'{sample_name}_Viruses_Genomad_Output/*/*_virus_summary.tsv'))[0]
IndexError: list index out of range
Error: Module 01 failed.
After investigating, I observed that geNomad modifies the contents of the genomad_db/ directory during runtime. Specifically, it creates a series of symlinks prefixed with genomad_mini_db*, which point back to the original database files. At the same time, the original database files are deleted during the run, presumably by other simultaneous jobs, causing the symlinks to break and downstream geNomad calls (e.g., mmseqs2) to fail due to missing resources like version.txt, .lookup, .source, or taxonomy files.
Example ls -lh genomad_db/ after a run:
$ ls -lh genomad_db/
lrwxrwxrwx 10 Jul 17 23:47 genomad_mini_db -> genomad_db
lrwxrwxrwx 17 Jul 17 23:47 genomad_mini_db.lookup -> genomad_db.lookup
lrwxrwxrwx 17 Jul 17 23:47 genomad_mini_db.source -> genomad_db.source
lrwxrwxrwx 12 Jul 17 23:47 genomad_mini_db_h -> genomad_db_h
lrwxrwxrwx 19 Jul 17 23:47 genomad_mini_db_h.dbtype -> genomad_db_h.dbtype
lrwxrwxrwx 18 Jul 17 23:47 genomad_mini_db_h.index -> genomad_db_h.index
lrwxrwxrwx 18 Jul 17 23:47 genomad_mini_db_mapping -> genomad_db_mapping
lrwxrwxrwx 19 Jul 17 23:47 genomad_mini_db_taxonomy -> genomad_db_taxonomy
This results in a broken database state and prevents geNomad
from completing its analysis, especially in parallel SLURM array jobs where each task likely tries to modify the same shared database directory.
Additional Details: I've confirmed that Module 01 itself does not modify the database, aside from passing its path to geNomad.
I also tried making the database and its contents read-only, but geNomad still created the symlinks, and the issue persisted.
This behavior seems to originate from within geNomad itself, perhaps due to its internal handling of the database.
My Questions: Has this issue been observed before in MVP or geNomad under parallel workloads?
Is there a recommended workaround for using geNomad in parallel (e.g., per-run database copies)?
Would it be helpful for MVP to implement logic that copies the DB per task to a temporary directory?
Thanks in advance for your help!
Thanks!
If further clarification is needed, I will be happy to provide more details!