Updates to the limited_build_dir feature in development
This MR attempts to make a number of improvements to the still in development ff_limit_build_dir configuration. The goal is to have a jacamar driven workflow by which file locks are used to claim a concurrent directory during initial configuration, thus diminishing the duplicate directories found in our current data_dir structure.
In addition to shifting to using the fcnlt call there has been changes to the configuration:
[general]
// FFLimitBuildDir [Feature Flag] enforces a limited structure on the builds_dir by creating
// a user driven process to automatically obtain concurrent directories through file locking.
ff_limit_build_dir = true
// MaxBuildDir indicates how many concurrent build directories can be left on the system. Only observed
// in conjunction with LimitBuildDir. (default: 0, only limited by cumulative runner concurrency).
max_build_dir = 0
// UncapBuildDirCleanup by default cleanup is limited to a single builds_dir in every job. This
// is to limit a CI job becoming "stuck" during cleanup, during which we lack the ability to
// directly notify the user of any cleanup actions.
uncap_build_dir_cleanup = false
// FileLockDebug if enabled will create a log file that outlines all actions of the
// 'jacamar lock' process occurring in userspace. This should only be used for troubleshooting
// potential errors with the process of generating/claiming file locks as there is no
// automated cleanup on these files.
file_lock_debug = false
// RevalidateLock when enabled attempts to re-validate the lock file(s) associated with the
// limited build directories before reporting. If any error is encountered a new lock file
// will be pursued. This currently has limited testing and should likely be used in conjunction
// with the FileLockDebug during initial deployment.
revalidate_lock = false
When enabled with a data_dir or /ci you can expect to see a structure like:
$ pwd
/ci/username/builds/project-name_uniqueID
$ ls -a
.000.lock .001.lock .002.lock 000 001 002
The CI_PROJECT_DIR would look like: /ci/username/builds/project-name_uniqueID/001/group/project-name
Debug files
If the file_lock_debug is enabled a new lock_debug directory will be created along with all the .lock files. Here every unique job will create a log of actions/errors associated with claiming the lock file:
$ cat 2424067.json
{"level":"info","msg":"unable to lock file 0: fcntl syscall error: resource temporarily unavailable","time":"2024-04-25T16:07:49Z"}
{"level":"info","msg":"lock file /ci/username/builds/project-name_uniqueID/.001.lock has not expired","time":"2024-04-25T16:07:50Z"}
{"level":"info","msg":"unable to lock file 2: fcntl syscall error: resource temporarily unavailable","time":"2024-04-25T16:07:50Z"}
{"level":"info","msg":"lock file /ci/username/builds/project-name_uniqueID/.003.lock has not expired","time":"2024-04-25T16:07:57Z"}
{"level":"info","msg":"file claimed with 1714066690 expiration on ci-test-2 host","time":"2024-04-25T16:08:10Z"}
{"level":"info","msg":"identified concurrent target: 004","time":"2024-04-25T16:08:15Z"}
Hopefully this can assist in troubleshooting during testing.
Testing without Installing
In order to test this without deployment I recommend using mpirun, srun, or other similar application to simulate multiple node/ranks all attempting to claim locks and identify a valid concurrent directory. Testing locking can be done individually with something like:
$ jacamar --no-auth lock -debug /fs/test 123 3600 488
000
- Optionally you can test the
revalidate_lockconfiguration with-revalidate.
You'll want to change the /fs/test to the network filesystem you are targeting. The rest of the arguments are the job id (in this case doesn't need to be unique), timeout in seconds for the job, and permissions. If successful the resulting stdout from each run will be a unique number.
Related to #152 (closed)