geNomad Creating Symlinks and Deleting geNomad Database Information

Hello,

I am running the following on a cluster environment, invoking a SLURM array for parallel processing of the input assemblies:

mvip MVP_01_run_genomad_checkv -i ${BASE_DIR}/MVP \
-m ${BASE_DIR}/MVP/MMSP_MVP_metadata.txt \
--genomad_relaxed \
--sample_group ${SLURM_ARRAY_TASK_ID} \
--threads ${SLURM_CPUS_PER_TASK}

When executing this module in parallel, I consistently receive errors like the following:

Traceback (most recent call last):
  File "/.conda/envs/mvip_env/bin/genomad", line 10, in <module>
    sys.exit(cli())
  File "/.conda/envs/mvip_env/lib/python3.8/site-packages/rich_click/rich_command.py", line 367, in __call__
    return super().__call__(*args, **kwargs)
  File "/.conda/envs/mvip_env/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/.conda/envs/mvip_env/lib/python3.8/site-packages/rich_click/rich_command.py", line 152, in main
    rv = self.invoke(ctx)
  File "/.conda/envs/mvip_env/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/.conda/envs/mvip_env/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/.conda/envs/mvip_env/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/.conda/envs/mvip_env/lib/python3.8/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/.conda/envs/mvip_env/lib/python3.8/site-packages/genomad/cli.py", line 1240, in end_to_end
    ctx.invoke(
  File "/.conda/envs/mvip_env/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/.conda/envs/mvip_env/lib/python3.8/site-packages/genomad/cli.py", line 441, in annotate
    genomad.annotate.main(
  File "/.conda/envs/mvip_env/lib/python3.8/site-packages/genomad/modules/annotate.py", line 168, in main
    database_obj = database.Database(database_path)
  File "/.conda/envs/mvip_env/lib/python3.8/site-packages/genomad/database.py", line 10, in __init__
    with open(database_directory / "version.txt") as fin:
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/MVP/00_DATABASES/genomad_db/version.txt'
Traceback (most recent call last):
  File "/.conda/envs/mvip_env/bin/mvip", line 10, in <module>
    sys.exit(cli())
  File "/.conda/envs/mvip_env/lib/python3.8/site-packages/mvip/cli.py", line 155, in cli
    args["func"](args)
  File "/.conda/envs/mvip_env/lib/python3.8/site-packages/mvip/modules/MVP_01_run_genomad_checkv.py", line 461, in main
    virus_genomad_file = glob.glob(os.path.join(args["input"], '01_GENOMAD', str(sample_name), f'{sample_name}_Viruses_Genomad_Output/*/*_virus_summary.tsv'))[0]
IndexError: list index out of range
Error: Module 01 failed.

After investigating, I observed that geNomad modifies the contents of the genomad_db/ directory during runtime. Specifically, it creates a series of symlinks prefixed with genomad_mini_db*, which point back to the original database files. At the same time, the original database files are deleted during the run, presumably by other simultaneous jobs, causing the symlinks to break and downstream geNomad calls (e.g., mmseqs2) to fail due to missing resources like version.txt, .lookup, .source, or taxonomy files.

Example ls -lh genomad_db/ after a run:

$ ls -lh genomad_db/
lrwxrwxrwx 10 Jul 17 23:47 genomad_mini_db -> genomad_db
lrwxrwxrwx 17 Jul 17 23:47 genomad_mini_db.lookup -> genomad_db.lookup
lrwxrwxrwx 17 Jul 17 23:47 genomad_mini_db.source -> genomad_db.source
lrwxrwxrwx 12 Jul 17 23:47 genomad_mini_db_h -> genomad_db_h
lrwxrwxrwx 19 Jul 17 23:47 genomad_mini_db_h.dbtype -> genomad_db_h.dbtype
lrwxrwxrwx 18 Jul 17 23:47 genomad_mini_db_h.index -> genomad_db_h.index
lrwxrwxrwx 18 Jul 17 23:47 genomad_mini_db_mapping -> genomad_db_mapping
lrwxrwxrwx 19 Jul 17 23:47 genomad_mini_db_taxonomy -> genomad_db_taxonomy

This results in a broken database state and prevents geNomad from completing its analysis, especially in parallel SLURM array jobs where each task likely tries to modify the same shared database directory.

Additional Details: I've confirmed that Module 01 itself does not modify the database, aside from passing its path to geNomad.

I also tried making the database and its contents read-only, but geNomad still created the symlinks, and the issue persisted.

This behavior seems to originate from within geNomad itself, perhaps due to its internal handling of the database.

My Questions: Has this issue been observed before in MVP or geNomad under parallel workloads?

Is there a recommended workaround for using geNomad in parallel (e.g., per-run database copies)?

Would it be helpful for MVP to implement logic that copies the DB per task to a temporary directory?

Thanks in advance for your help!

Thanks!

If further clarification is needed, I will be happy to provide more details!

Edited Jul 18, 2025 by shaconn