Jeffrey Frey requested to merge jtfrey/octopus:cube-io-patch into main Mar 06, 2024

Description

A user experienced routine hangs of octopus parallel TD runs when charge cube files were being written to output_iter/. The issue would manifest when jobs were scaled-up to 10 nodes x 64 ranks-per-node. The files were resident on a Lustre 2.13.55 file system.

The user noted when watching the file system that a particular cube file would appear with size 1K, grow, then reappear at 1K with the process repeating many times. The end result of the hang would be that Lustre i/o was suspended on the node, with the OST holding the cube file no longer mounted. While the majority of the instances left one rank behind (hung in Lustre i/o write) a few instances had a second rank present concurrently attempting to open and truncate the same cube file.

Though this certainly highlights a locking issue in Lustre, it also highlighted the fact that octopus is writing the cube file multiple times — 640 times in these cases — which will also slow execution of the code. Tracing through the source showed that the X(io_function_output) subroutine in src/grid/io_function_inc.F90 had no code present to limit cube output, while the mesh output code did have such limitations in place.

This patch introduces code to X(io_function_output) to limit cube output to a single rank.

News snippet

Global cube output is now limited to a single MPI rank versus being repeated across all ranks.

Checklist

I have checked that my code follows the Octopus coding standards
I have added tests for all the new features added in this request. (no tests necessary, bug fix)

Limit writing of global cube file in `X(io_function_output)` to a single rank when appropriate

Description

News snippet

Checklist

Merge request reports