Skip to content
Update Cluster Guide authored by Trisha Sheth's avatar Trisha Sheth
...@@ -283,26 +283,30 @@ active_stage = cluster_chain_stage (ctrl_chain) ...@@ -283,26 +283,30 @@ active_stage = cluster_chain_stage (ctrl_chain)
Returns the active stage for each chain in ctrl_chain. Returns the active stage for each chain in ctrl_chain.
cluster_cleanup # cluster_cleanup
```
ctrl = cluster_cleanup(...) ctrl = cluster_cleanup(...)
```
Cleans up (deletes temporary files and jobs) a batch, list of batches, or a batch chain. Cleans up (deletes temporary files and jobs) a batch, list of batches, or a batch chain.
cluster_compile # cluster_compile
```
cluster_compile(...) cluster_compile(...)
```
Compiles the cluster job function cluster_job.m. Compiles the cluster job function cluster_job.m.
cluster_error_mask # cluster_error_mask
```
cluster_error_mask(error_mask) cluster_error_mask(error_mask)
```
If error_mask is passed in, interprets the error mask and prints out information about each error that occurred. If no input arguments, this function sets all the error mask variables in the caller's workspace. If error_mask is passed in, interprets the error mask and prints out information about each error that occurred. If no input arguments, this function sets all the error mask variables in the caller's workspace.
cluster_exec_task # cluster_exec_task
```
cluster_exec_task(ctrl,task_ids,run_mode) cluster_exec_task(ctrl,task_ids,run_mode)
```
Executes task(s) in the current matlab session. This is useful for debugging or sometimes to avoid long queue wait times if only a few jobs have failed and need to rerun. For example, to run task TASK_ID from batch BATCH_ID: Executes task(s) in the current matlab session. This is useful for debugging or sometimes to avoid long queue wait times if only a few jobs have failed and need to rerun. For example, to run task TASK_ID from batch BATCH_ID:
cluster_exec_task(BATCH_ID, TASK_ID) cluster_exec_task(BATCH_ID, TASK_ID)
...@@ -319,62 +323,75 @@ Gets a batch and populates a ctrl structure for that batch. ...@@ -319,62 +323,75 @@ Gets a batch and populates a ctrl structure for that batch.
Update_mode should be set to 0 if this is not the process running the batch because it can cause errors in the status files if two processes are accessing them. Update_mode should be set to 0 if this is not the process running the batch because it can cause errors in the status files if two processes are accessing them.
cluster_get_batch_list # cluster_get_batch_list
```
ctrls = cluster_get_batch_list(param) ctrls = cluster_get_batch_list(param)
```
Get a list of all batches. Leave param undefined to use gRadar.cluster.data_location to search for batches. Get a list of all batches. Leave param undefined to use gRadar.cluster.data_location to search for batches.
cluster_get_chain_list # cluster_get_chain_list
```
cluster_get_chain_list (param) cluster_get_chain_list (param)
```
Prints a list of all chains. Leave param undefined to use gRadar.cluster.data_location to search for chains. Prints a list of all chains. Leave param undefined to use gRadar.cluster.data_location to search for chains.
cluster_hold # cluster_hold
```
ctrls = cluster_hold(batch_id,hold_state) ctrls = cluster_hold(batch_id,hold_state)
```
Sends a hold command to cluster_run and causes it to enter debug mode. Sends a hold command to cluster_run and causes it to enter debug mode.
# cluster_job
```
cluster_job cluster_job
cluster_job ```
Matlab function that runs the job on the cluster. This is the function which is compiled and calls the desired task functions. Matlab function that runs the job on the cluster. This is the function which is compiled and calls the desired task functions.
# cluster_job.sh
```
cluster_job.sh cluster_job.sh
cluster_job.sh ```
Bash script which is used by torque and slurm cluster types. This is the job which is called by torque or slurm. This bash script calls cluster_job.m (which when compiled creates run_cluster_job.sh). Bash script which is used by torque and slurm cluster types. This is the job which is called by torque or slurm. This bash script calls cluster_job.m (which when compiled creates run_cluster_job.sh).
cluster_load_chain # cluster_load_chain
```
ctrl = cluster_load_chain([],chain_id) ctrl = cluster_load_chain([],chain_id)
```
Loads a batch chain from a chain file. chain_id is a unique positive integer. Loads a batch chain from a chain file. chain_id is a unique positive integer.
cluster_new_batch #cluster_new_batch
cluster_new_batch
```
cluster_new_batch
```
Creates a new batch. Creates a new batch.
This creates a new batch directory and support files in the batch directory. The default parameters are loaded from gRadar.cluster. This creates a new batch directory and support files in the batch directory. The default parameters are loaded from gRadar.cluster.
cluster_new_task # cluster_new_task
ctrl = cluster_new_task(ctrl,sparam,dparam,varargin)
```
ctrl = cluster_new_task(ctrl,sparam,dparam,varargin)
```
Create a new task in the batch specified by ctrl. The dparam_save option can be set to zero to speed up task creation if many tasks will be created. If this option is used, after all the tasks are created, the cluster_save_dparam function needs to be called or the dynamic parameters will not be saved to disk (the tasks will therefore not have access to the dynamic parameters when they run). Create a new task in the batch specified by ctrl. The dparam_save option can be set to zero to speed up task creation if many tasks will be created. If this option is used, after all the tasks are created, the cluster_save_dparam function needs to be called or the dynamic parameters will not be saved to disk (the tasks will therefore not have access to the dynamic parameters when they run).
Input arguments: Input arguments:
Name Description |Name |Description|
ctrl batch structure array (may not be a batch ID) |----|-------------------|
sparam static param structure (only has an effect for the first task that is created for the batch) |ctrl| batch structure array (may not be a batch ID)|
dparam dynamic param structure that will be merged with the sparam structure when the task is run |sparam |static param structure (only has an effect for the first task that is created for the batch)|
varargin name-value pairs (e.g. 'dparam_save',0 could be passed in as the 4th and 5th arguments to turn on dynamic parameter saving during task creation) |dparam| dynamic param structure that will be merged with the sparam structure when the task is run|
|varargin| name-value pairs (e.g. 'dparam_save',0 could be passed in as the 4th and 5th arguments to turn on dynamic parameter saving during task creation)|
sparam and dparam each have the same fields. Each field only needs to be set in one or the other structure. For example, one could create the task like this: sparam and dparam each have the same fields. Each field only needs to be set in one or the other structure. For example, one could create the task like this:
```
sparam.task_function = @qlook; sparam.task_function = @qlook;
sparam.num_args_out = 1; sparam.num_args_out = 1;
dparam.cpu_time = 100; dparam.cpu_time = 100;
...@@ -383,63 +400,88 @@ dparam.notes = 'Test'; ...@@ -383,63 +400,88 @@ dparam.notes = 'Test';
param = struct('qlook',struct(some_static_field,3)); param = struct('qlook',struct(some_static_field,3));
sparam.argsin{1} = param; sparam.argsin{1} = param;
dparam.argsin{1}.qlook.some_dynamic_field = 2; dparam.argsin{1}.qlook.some_dynamic_field = 2;
```
Note that sparam is only written out for the very first task. Any changes to sparam in subsequent tasks will have no effect. Note that sparam is only written out for the very first task. Any changes to sparam in subsequent tasks will have no effect.
Field Name Description |Field Name| Description|
task_function function handle of job, this function handle tells cluster_job.m what to run |--------|-----------------|
argsin cell vector of input arguments (default is {}) |task_function| function handle of job, this function handle tells cluster_job.m what to run|
num_args_out number of output arguments to expect (default is 0) |argsin| cell vector of input arguments (default is {})|
notes optional note to print after successful submission of job default is '' (nothing is written out) |num_args_out| number of output arguments to expect (default is 0)|
cpu_time maximum cpu time of this task in seconds |notes| optional note to print after successful submission of job default is '' (nothing is written out)|
mem maximum memory usage of this task in bytes (default is 0) |cpu_time| maximum cpu time of this task in seconds|
file_success cell array of files that must exist to determine the task a success. If the file is a .mat file, then it must have the file_version variable and not be marked for deletion. |mem |maximum memory usage of this task in bytes (default is 0)|
sparam and dparam will be merged when the task runs. The merging works across cell arrays too. Therefore, setting: |file_success| cell array of files that must exist to determine the task a success. If the file is a .mat file, then it must have the file_version variable and not be marked for deletion.|
sparam and dparam will be merged when the task runs. The merging works across cell arrays too.
Therefore, setting:
```
sparam.argsin{1}.my_task.sfield = 1; sparam.argsin{1}.my_task.sfield = 1;
dparam.argsin{1}.my_task.dfield= 2; dparam.argsin{1}.my_task.dfield= 2;
```
will cause the first argument to the task, (argsin{1}), to be a structure array with the field "my_task" which is itself a structure array with two fields: sfield and dfield. For example, if your task is a function: will cause the first argument to the task, (argsin{1}), to be a structure array with the field "my_task" which is itself a structure array with two fields: sfield and dfield. For example, if your task is a function:
```
function success = my_task(param) function success = my_task(param)
```
then inside your function you will have these fields: then inside your function you will have these fields:
```
param.my_task.sfield = 1; param.my_task.sfield = 1;
param.my_task.dfield = 2; param.my_task.dfield = 2;
cluster_print ```
# cluster_print
```
cluster_print(ctrl,ids,print_mode,ids_type) cluster_print(ctrl,ids,print_mode,ids_type)
```
Prints or gets information about a particular task or set of tasks in a batch specified by ctrl. The type of ID defaults to a task ID (ids_type equal to 0). If ids_type is set to 1, then it looks up by the torque job ID. Prints or gets information about a particular task or set of tasks in a batch specified by ctrl. The type of ID defaults to a task ID (ids_type equal to 0). If ids_type is set to 1, then it looks up by the torque job ID.
cluster_print(ctrl,task_id,1) **cluster_print(ctrl,task_id,1)**
Prints full information about one task. Prints full information about one task.
[in,out] = cluster_print(ctrl,task_id,0) **[in,out] = cluster_print(ctrl,task_id,0)**
Returns input and output information for one task. Returns input and output information for one task.
cluster_print(ctrl,task_ids,2) **cluster_print(ctrl,task_ids,2)**
Print tables and returns a struct with these fields for each task ID specified: Print tables and returns a struct with these fields for each task ID specified:
field description
task_id The task ID is the index into the task. It can be from 1 to N where N is the number of tasks in the batch.
job_id The ID of the job. This is the ID used by the cluster interface.
job_status Job status
error_flag Job error state
cpu_time_req CPU time requested in minutes
cpu_time CPU time currently used in minutes
memory_req Memory requested in MB
memory Memory currently being used in MB
schedule Scheduled time to run as date string.
cluster_print_chain
cluster_print_chain(ctrl_chain)
Prints and gets job state, cpu, and memory information about all jobs in a control chain. cluster_print_chain uses the saved chain information unless a chain cell array is passed in. This occasionally causes erroneous reporting. For example, if the chain is saved in the debug state, but is actually running on the cluster, the running and queued task totals will be zero and these jobs will all be marked as pending. |field |description|
|-------|---------------|
|task_id| The task ID is the index into the task. It can be from 1 to N where N is the number of tasks in the batch.|
|job_id| The ID of the job. This is the ID used by the cluster interface.|
|job_status| Job status|
|error_flag| Job error state|
|cpu_time_req| CPU time requested in minutes|
|cpu_time| CPU time currently used in minutes|
|memory_req| Memory requested in MB|
|memory| Memory currently being used in MB|
|schedule| Scheduled time to run as date string.|
# cluster_print_chain
```
cluster_print_chain(ctrl_chain)
```
Prints and gets job state, cpu, and memory information about all jobs in a control chain. cluster_print_chain uses the saved chain information unless a chain cell array is passed in. This occasionally causes erroneous reporting. For example, if the chain is saved in the debug state, but is actually running on the cluster, the running and queued task totals will be zero and these jobs will all be marked as pending.
```
Number of tasks: 771, 198/315/255/3 C/R/Q/T, 0 error, 0 retries Number of tasks: 771, 198/315/255/3 C/R/Q/T, 0 error, 0 retries
```
The task output information stands for: C/R/Q/T = Complete/Running/Queued/Pending. The task output information stands for: C/R/Q/T = Complete/Running/Queued/Pending.
cluster_reset # cluster_reset
```
ctrl_chain = cluster_reset(ctrl_chain) ctrl_chain = cluster_reset(ctrl_chain)
```
Resets fields in a control chain so that the control chain may be run again with cluster_run. Usually this is done after some tasks failed in the control chain and another attempt is being made to run those tasks. Resets fields in a control chain so that the control chain may be run again with cluster_run. Usually this is done after some tasks failed in the control chain and another attempt is being made to run those tasks.
cluster_run # cluster_run
```
ctrl_chain = cluster_run(ctrl_chain, cluster_run_mode) ctrl_chain = cluster_run(ctrl_chain, cluster_run_mode)
```
Runs a list of chains (each chain is a list of batches that must be run in serial) or a batch. This function has blocking/polling and non-blocking modes. If the ctrl_chain has been run before, but the ctrl_chain cell array does not have the most recent state information in it, use cluster_run_mode 2 (non-blocking) or 3 (blocking) to tell cluster_run to first query the state of every batch before running it. This query is very slow for large batches and so it is much more efficient to make sure you keep track of the ctrl_chain with all the state information intact. If you have done this, then use cluster_run_mode 0 (non-blocking) or 1 (blocking) to skip the query step. Runs a list of chains (each chain is a list of batches that must be run in serial) or a batch. This function has blocking/polling and non-blocking modes. If the ctrl_chain has been run before, but the ctrl_chain cell array does not have the most recent state information in it, use cluster_run_mode 2 (non-blocking) or 3 (blocking) to tell cluster_run to first query the state of every batch before running it. This query is very slow for large batches and so it is much more efficient to make sure you keep track of the ctrl_chain with all the state information intact. If you have done this, then use cluster_run_mode 0 (non-blocking) or 1 (blocking) to skip the query step.
...@@ -447,49 +489,67 @@ A typical example of where cluster_run_mode is required would be if cluster_run ...@@ -447,49 +489,67 @@ A typical example of where cluster_run_mode is required would be if cluster_run
Typically, if you are just polling the status of your jobs rather than blocking until they are all done, you would run: Typically, if you are just polling the status of your jobs rather than blocking until they are all done, you would run:
```
ctrl_chain = cluster_load_chain(CHAIN_NUMBER); % CHAIN_NUMBER should be your most recent saved version of this chain ctrl_chain = cluster_load_chain(CHAIN_NUMBER); % CHAIN_NUMBER should be your most recent saved version of this chain
ctrl_chain = cluster_run(ctrl_chain,0); ctrl_chain = cluster_run(ctrl_chain,0);
cluster_save_chain(ctrl_chain); % Note the CHAIN_NUMBER since this will be what you load the next time you poll. cluster_save_chain(ctrl_chain); % Note the CHAIN_NUMBER since this will be what you load the next time you poll.
```
Then you would repeat these commands each time you want to poll the status of the job chains. Then you would repeat these commands each time you want to poll the status of the job chains.
cluster_save_chain # cluster_save_chain
```
[chain_fn,chain_id] = cluster_save_chain(ctrl_chain) [chain_fn,chain_id] = cluster_save_chain(ctrl_chain)
```
Saves a batch chain to a chain file. chain_id is a unique positive integer. This function also prints out all the batch files and directories that are required to run the chain. This is useful if the batch files need to be generated on one computer system and then moved to another computer system to be run. Saves a batch chain to a chain file. chain_id is a unique positive integer. This function also prints out all the batch files and directories that are required to run the chain. This is useful if the batch files need to be generated on one computer system and then moved to another computer system to be run.
cluster_save_dparam # cluster_save_dparam
```
ctrl = cluster_save_dparam(ctrl); ctrl = cluster_save_dparam(ctrl);
```
Save the dparam structure to the dynamic inputs file. This is needed with the dparam_save option is set to 0 when cluster_new_task is called. Save the dparam structure to the dynamic inputs file. This is needed with the dparam_save option is set to 0 when cluster_new_task is called.
cluster_set_chain # cluster_set_chain
```
ctrl_chain = cluster_set_chain(ctrl_chain,varargin) ctrl_chain = cluster_set_chain(ctrl_chain,varargin)
```
Sets a parameter in every batch in a control chain. Note that some parameters, notably cluster.type, require additional manual functions to be run if changed with cluster_set_chain. If cluster.type is set to matlab, then the "ctrl.cluster.jm = parcluster" may need to be run manually for each batch. If the cluster.type is set to torque or slurm, then cluster_compile may need to be run manually if there have been code changes since the last compile. Sets a parameter in every batch in a control chain. Note that some parameters, notably cluster.type, require additional manual functions to be run if changed with cluster_set_chain. If cluster.type is set to matlab, then the "ctrl.cluster.jm = parcluster" may need to be run manually for each batch. If the cluster.type is set to torque or slurm, then cluster_compile may need to be run manually if there have been code changes since the last compile.
For example, to change the type to debug mode: For example, to change the type to debug mode:
```
ctrl_chain = cluster_set_chain(ctrl_chain,'cluster.type','debug') ctrl_chain = cluster_set_chain(ctrl_chain,'cluster.type','debug')
```
cluster_stop # cluster_stop
```
cluster_stop(ctrl_chain,mode) cluster_stop(ctrl_chain,mode)
```
Stops/cancels/deletes jobs that are running in a cluster for the batches identified in ctrl_chain. The batches are also put on hold. To specify a batch or batches, the second argument can be set to 'batch' (default is 'chain'). Only useful for 'torque', 'slurm', and 'matlab' modes of operation. Stops/cancels/deletes jobs that are running in a cluster for the batches identified in ctrl_chain. The batches are also put on hold. To specify a batch or batches, the second argument can be set to 'batch' (default is 'chain'). Only useful for 'torque', 'slurm', and 'matlab' modes of operation.
cluster_submit_batch # cluster_submit_batch
```
ctrl = cluster_submit_batch(fun,block,argsin,num_argsout,cpu_time,mem) ctrl = cluster_submit_batch(fun,block,argsin,num_argsout,cpu_time,mem)
```
Convenient function which creates a batch, adds a task to it, runs the batch, and then cleans up the batch. This can be used as a simple example of how to use the cluster interface. Convenient function which creates a batch, adds a task to it, runs the batch, and then cleans up the batch. This can be used as a simple example of how to use the cluster interface.
cluster_submit_job # cluster_submit_job
```
ctrl = cluster_submit_job(ctrl,job_tasks,job_cpu,job_mem) ctrl = cluster_submit_job(ctrl,job_tasks,job_cpu,job_mem)
```
Submits a job to the cluster. For the debug type cluster, this function also executes the tasks in the job. Submits a job to the cluster. For the debug type cluster, this function also executes the tasks in the job.
cluster_update_batch # cluster_update_batch
```
ctrl = cluster_task_status(ctrl,force_success_check) ctrl = cluster_task_status(ctrl,force_success_check)
```
Updates the status of the batch Updates the status of the batch
...@@ -500,214 +560,236 @@ Another Matlab session is accessing the state of a batch and the cluster specifi ...@@ -500,214 +560,236 @@ Another Matlab session is accessing the state of a batch and the cluster specifi
If a job is completed, the error flags for the tasks executed in that job will be updated in the ctrl structure depending on the success condition and the contents of the task's out file. If a job is completed, the error flags for the tasks executed in that job will be updated in the ctrl structure depending on the success condition and the contents of the task's out file.
cluster_update_task # cluster_update_task
```
ctrl = cluster_update_task (ctrl,task_id) ctrl = cluster_update_task (ctrl,task_id)
```
Updates the status of the task Updates the status of the task
A tasks "error_mask" uses a binary mask to separate different types of errors. Matlab commands bitor, bitand, or dec2bin are useful for interpreting this mask. A tasks "error_mask" uses a binary mask to separate different types of errors. Matlab commands bitor, bitand, or dec2bin are useful for interpreting this mask.
Task error_mask Task error_mask
Error bit Error decimal Description |Error bit| Error decimal| Description|
b0000 0000 0000 0001 1 Output file out_TASKID does not exist |----------|----------------------|---------------|
b0000 0000 0000 0010 2 Output file out_TASKID does not load properly |b0000 0000 0000 0001 |1| Output file out_TASKID does not exist|
b0000 0000 0000 0100 4 argsout variable does not exist in output file out_TASKID |b0000 0000 0000 0010 |2| Output file out_TASKID does not load properly|
b0000 0000 0000 1000 8 argsout variable has wrong length |b0000 0000 0000 0100| 4| argsout variable does not exist in output file out_TASKID|
b0000 0000 0001 0000 16 errorstruct variable does not exist in output file out_TASKID |b0000 0000 0000 1000 |8| argsout variable has wrong length|
b0000 0000 0010 0000 32 errorstruct variable contains error |b0000 0000 0001 0000 |16| errorstruct variable does not exist in output file out_TASKID|
b0000 0000 0100 0000 64 success criteria error |b0000 0000 0010 0000 |32| errorstruct variable contains error|
b0000 0000 1000 0000 128 cluster killed error (torque only) |b0000 0000 0100 0000 |64 |success criteria error|
b0000 0001 0000 0000 256 wall time exceeded (torque only). This error may be fixed by adjusting the scripts which estimate the cpu time usage, or by increasing ctrl.cluster.cpu_time_mult. |b0000 0000 1000 0000 |128| cluster killed error (torque only)|
b0000 0010 0000 0000 512 Task success criteria failed to evaluate |b0000 0001 0000 0000 |256| wall time exceeded (torque only). This error may be fixed by adjusting the scripts which estimate the cpu time usage, or by increasing ctrl.cluster.cpu_time_mult.|
b0000 0100 0000 0000 1024 Output files that are used to check task success do not exist. |b0000 0010 0000 0000| 512| Task success criteria failed to evaluate|
b0000 1000 0000 0000 2048 Output files that are used to check task success are corrupt and cannot be read. |b0000 0100 0000 0000| 1024| Output files that are used to check task success do not exist.|
b0001 0000 0000 0000 4096 Maximum memory exceeded error (job used more memory than was requested). This error can be fixed by adjusting the scripts which estimate the memory usage, increasing ctrl.cluster.mem_mult, or by setting ctrl.cluster.max_mem_mode to 'auto'. |b0000 1000 0000 0000| 2048 |Output files that are used to check task success are corrupt and cannot be read.|
b0010 0000 0000 0000 8192 Insufficient space for matlab compiler runtime (MCR) cache so job could not be started. The primary and secondar cache are set by environment variables MCR_CACHE_ROOT and MCR_CACHE_ROOT2. |b0001 0000 0000 0000 |4096 |Maximum memory exceeded error (job used more memory than was requested). This error can be fixed by adjusting the scripts which estimate the memory usage, increasing ctrl.cluster.mem_mult, or by setting ctrl.cluster.max_mem_mode to 'auto'.|
Cluster Settings |b0010 0000 0000 0000| 8192| Insufficient space for matlab compiler runtime (MCR) cache so job could not be started. The primary and secondar cache are set by environment variables MCR_CACHE_ROOT and MCR_CACHE_ROOT2.|
# Cluster Settings
User settings used by cluster programs: User settings used by cluster programs:
Cluster Setting Default Description |Cluster Setting| Default |Description|
cluster.cluster_job_fn fullfile(gRadar.path,'cluster','cluster_job.sh') Sets the location of the cluster_job.sh bash program. |---------------|----------------|---------------|
cluster.cpu_time_mult 1 This is a multiplication factor that will be applied to the supplied cpu time required for a task. |cluster.cluster_job_fn| fullfile(gRadar.path,'cluster','cluster_job.sh')| Sets the location of the cluster_job.sh bash program.|
cluster.data_location gRadar.cluster.data_location Sets the location of the batch directories. |cluster.cpu_time_mult| 1| This is a multiplication factor that will be applied to the supplied cpu time required for a task.|
cluster.dbstop_if_error true Turns on "dbstop if error" when running tasks and using cluster.type debug. |cluster.data_location| gRadar.cluster.data_location| Sets the location of the batch directories.|
cluster.desired_time_per_job 0 Sets the desired time per job. This controls how tasks will be divided into jobs. Setting to zero means one task per job. |cluster.dbstop_if_error| true| Turns on "dbstop if error" when running tasks and using cluster.type debug.|
cluster.file_check_pause 4 Sets the number of seconds to wait for the output file when it is expected. |cluster.desired_time_per_job| 0 |Sets the desired time per job. This controls how tasks will be divided into jobs. Setting to zero means one task per job.|
cluster.file_version '-v7' Sets the version of the static (static.mat) and dynamic (dynamic.mat) input files and of the output files (out_TASKID.mat) used by the cluster. NOTE: -v6 does not support unicode characters. -v7 supports unicode characters. -v7.3 is HDF, but is generally not as efficient for this task. |cluster.file_check_pause| 4| Sets the number of seconds to wait for the output file when it is expected.|
cluster.force_compile false If true, cluster_compile will always recompile even if no files have changed. |cluster.file_version| '-v7'| Sets the version of the static (static.mat) and dynamic (dynamic.mat) input files and of the output files (out_TASKID.mat) used by the cluster. NOTE: -v6 does not support unicode characters. -v7 supports unicode characters. -v7.3 is HDF, but is generally not as efficient for this task.|
cluster.hidden_depend_funs {} Cell array of hidden dependency functions needed by cluster_compile. |cluster.force_compile| false| If true, cluster_compile will always recompile even if no files have changed.|
cluster.job_complete_pause {} For Matlab compiled jobs, this is the pause in seconds after the job completes. Useful on large file systems/clusters that may have long delay times for output files to show up to all computers. |cluster.hidden_depend_funs| {} |Cell array of hidden dependency functions needed by cluster_compile.|
cluster.matlab_mcr_path matlabroot Sets the location of the Matlab Compile Runtime library installation. If you have Matlab installed, this is just your Matlab installation directory. Only used for torque and slurm. |cluster.job_complete_pause| {}| For Matlab compiled jobs, this is the pause in seconds after the job completes. Useful on large file systems/clusters that may have long delay times for output files to show up to all computers.|
cluster.max_jobs_active 1 Sets the maximum number of active (queued or running) jobs. |cluster.matlab_mcr_path| matlabroot| Sets the location of the Matlab Compile Runtime library installation. If you have Matlab installed, this is just your Matlab installation directory. Only used for torque and slurm.|
cluster.max_mem_mode 'debug' String that controls how cluster_update_task.m will behave when the max memory is exceeded. Options are 'debug' which drops the session into debug mode when memory is exceeded and 'auto' which doubles the memory request and resubmits the job. |cluster.max_jobs_active| 1| Sets the maximum number of active (queued or running) jobs.
cluster.max_retries 1 Sets the maximum number of retries per task before it gives up running that task. |cluster.max_mem_mode| 'debug' String that controls how cluster_update_task.m will behave when the max memory is exceeded. Options are 'debug' which drops the session into debug mode when memory is exceeded and 'auto' which doubles the memory request and resubmits the job.|
cluster.max_time_per_job 86400 Sets the maximum time per job in seconds. |cluster.max_retries| 1| Sets the maximum number of retries per task before it gives up running that task.|
cluster.mcc 'system' If set to 'system' it runs mcc from the command line using the system() function. This is preferable because the Matlab Compiler license gets released when finished. However, mcc from the command line may not work. If so, set to 'eval' and mcc will be run from within Matlab. The downside is that the license will not be released until this Matlab session ends. |cluster.max_time_per_job| 86400 |Sets the maximum time per job in seconds.|
cluster.mcr_cache_root '/tmp' Sets the location of the Matlab Compile Runtime temporary folder. Only used for torque and slurm. |cluster.mcc| 'system'| If set to 'system' it runs mcc from the command line using the system() function. This is preferable because the Matlab Compiler license gets released when finished. However, mcc from the command line may not work. If so, set to 'eval' and mcc will be run from within Matlab. The downside is that the license will not be released until this Matlab session ends.|
cluster.mem_mult 1 This is a multiplication factor that will be applied to the supplied memory required for a task. |cluster.mcr_cache_root| '/tmp'| Sets the location of the Matlab Compile Runtime temporary folder. Only used for torque and slurm.|
cluster.mem_to_ppn [] If not empty, this causes memory requirements to be converted into processor requirements. This is useful when Torque ignores the memory requirement. For example with 46 processors per node and 120e9 bytes of memory per node available to the cluster, one would set cluster.mem_to_ppn = 120e9/46. The max_ppn parameter must be set as well. |cluster.mem_mult| 1 |This is a multiplication factor that will be applied to the supplied memory required for a task.|
cluster.max_ppn [] To be used when mem_to_ppn is set. This should usually be set equal to the number of cores on a processor. The valid range is 1 to the number of cores on a processor. Since there is no way with torque to ensure that a task requesting multiple nodes will have all the nodes on one machine, we are restricted to requesting one node. We cannot request more cores than this node has or the job will never run. |cluster.mem_to_ppn| []| If not empty, this causes memory requirements to be converted into processor requirements. This is useful when Torque ignores the memory requirement. For example with 46 processors per node and 120e9 bytes of memory per node available to the cluster, one would set |cluster.mem_to_ppn| = 120e9/46. The max_ppn parameter must be set as well.|
cluster.qsub_submit_arguments '-m n -l nodes=1:ppn=%p,pmem=%m,walltime=%t' Submit argument string for torque. Note memory and time requirements are inserted with regexprep inside cluster_submit_job.m. |cluster.max_ppn |[] |To be used when mem_to_ppn is set. This should usually be set equal to the number of cores on a processor. The valid range is 1 to the number of cores on a processor. Since there is no way with torque to ensure that a task requesting multiple nodes will have all the nodes on one machine, we are restricted to requesting one node. We cannot request more cores than this node has or the job will never run.|
cluster.ssh_hostname Optional cluster hostname. If empty or undefined, then cluster commands are run on the local machine. If specified and not empty, cluster commands will be submitted using "ssh -p %d -o LogLevel=QUIET -t %s@%s "COMMAND" where the port=cluster.ssh_port, username=cluster.ssh_username and hostname=cluster.ssh_hostname. This functionality requires that the user's login does not print any text to the screen because this will confuse the cluster software's interpretation of the output (i.e. this remote server implementation is not robust). Commands that print to the screen should have " &>/dev/null" and/or " >/dev/null" added to the end of each of them. Occasionally, the "ssh" command hangs which causes the matlab process to hang. Run "ps -ef | grep USERNAME | grep ssh" several times over the course of one minute from the terminal and if you see the same command show up in the list every time, then the command has probably hung and should be killed. To kill the process, run "kill PROCESS_ID". For example, "jpaden 9814 47277 0 01:00 pts/21 00:00:00 ssh -p 22 -o LogLevel=QUIET -t jpaden@karst.uits.iu.edu qstat -u jpaden </dev/null" is killed by "kill 9814". |cluster.qsub_submit_arguments| '-m n -l nodes=1:ppn=%p,pmem=%m,walltime=%t'| Submit argument string for torque. Note memory and time requirements are inserted with regexprep inside cluster_submit_job.m.|
|cluster.ssh_hostname || Optional cluster hostname. If empty or undefined, then cluster commands are run on the local machine. If specified and not empty, cluster commands will be submitted using "ssh -p %d -o LogLevel=QUIET -t %s@%s "COMMAND" where the port=cluster.ssh_port, username=cluster.ssh_username and hostname=cluster.ssh_hostname. This functionality requires that the user's login does not print any text to the screen because this will confuse the cluster software's interpretation of the output (i.e. this remote server implementation is not robust). Commands that print to the screen should have " &>/dev/null" and/or " >/dev/null" added to the end of each of them. Occasionally, the "ssh" command hangs which causes the matlab process to hang. Run "ps -ef | grep USERNAME | grep ssh" several times over the course of one minute from the terminal and if you see the same command show up in the list every time, then the command has probably hung and should be killed. To kill the process, run "kill PROCESS_ID". For example, "jpaden 9814 47277 0 01:00 pts/21 00:00:00 ssh -p 22 -o LogLevel=QUIET -t jpaden@karst.uits.iu.edu qstat -u jpaden </dev/null" is killed by "kill 9814".
Also, no password should be required, so it is likely that a key should be setup. Reminder steps: Also, no password should be required, so it is likely that a key should be setup. Reminder steps:
ssh-keygen -t rsa ssh-keygen -t rsa
ssh username@hostname mkdir -p .ssh ssh username@hostname mkdir -p .ssh
cat .ssh/id_rsa.pub | ssh username@hostname 'cat >> .ssh/authorized_keys' cat .ssh/id_rsa.pub ssh username@hostname 'cat >> .ssh/authorized_keys'|
cluster.ssh_port 22 If ssh_hostname is specified, this ssh_port will be used. |cluster.ssh_port| 22 |If ssh_hostname is specified, this ssh_port will be used.|
cluster.ssh_username whoami If ssh_hostname is specified, this ssh_username will be used. The default username is the response from the whoami command on the local machine. |cluster.ssh_username| whoami| If ssh_hostname is specified, this ssh_username will be used. The default username is the response from the whoami command on the local machine.|
cluster.slurm_submit_arguments '-N 1 -n 1 --mem=%m --time=%t' Submit argument string for slurm. Note memory and time requirements are inserted with regexprep inside cluster_submit_job.m. QOS and partition requests can be added here. For example "-N 1 -n 1 --mem=%m --time=%t -p fat --qos=short" would submit to the "fat" partition and set the quality of service to "short". |cluster.slurm_submit_arguments| '-N 1 -n 1 --mem=%m --time=%t' |Submit argument string for slurm. Note memory and time requirements are inserted with regexprep inside cluster_submit_job.m. QOS and partition requests can be added here. For example "-N 1 -n 1 --mem=%m --time=%t -p fat --qos=short" would submit to the "fat" partition and set the quality of service to "short".|
cluster.stat_pause 1 Sets the number of seconds to pause between each status check. |cluster.stat_pause |1| Sets the number of seconds to pause between each status check.|
cluster.stop_on_error 1 If a Matlab error occurs in the execution of a task, then the submission script goes into debug mode in cluster_update_task.m. |cluster.stop_on_error| 1| If a Matlab error occurs in the execution of a task, then the submission script goes into debug mode in cluster_update_task.m.|
cluster.submit_pause 0 Sets the number of seconds to pause between each submission. |cluster.submit_pause| 0| Sets the number of seconds to pause between each submission.|
cluster.type 'debug' Sets the cluster type to run jobs (torque, matlab, slurm, debug). |cluster.type| 'debug' |Sets the cluster type to run jobs (torque, matlab, slurm, debug).|
User settings used by user programs: User settings used by user programs:
Cluster Setting Default Description |Cluster Setting| Default| Description|
cluster.rerun_only false This is a property used by the user. Typically it means a task will not be created if its output already exists. |------------------|-------------|-----------------|
|cluster.rerun_only| false| This is a property used by the user. Typically it means a task will not be created if its output already exists.|
Settings that should not be modified by user. "BATCH" in the filenames below represents the directory name where the batch files are stored. BATCH is formatted like this "batch_BB_tp20f818ed_5a79_4002_b41a_bf91ab030392" where the "BB" represents that batch number (positive integer starting at 1) and "tp20f818ed_5a79_4002_b41a_bf91ab030392" represents a random string generated by the Matlab tempname.m command. Settings that should not be modified by user. "BATCH" in the filenames below represents the directory name where the batch files are stored. BATCH is formatted like this "batch_BB_tp20f818ed_5a79_4002_b41a_bf91ab030392" where the "BB" represents that batch number (positive integer starting at 1) and "tp20f818ed_5a79_4002_b41a_bf91ab030392" represents a random string generated by the Matlab tempname.m command.
Cluster Setting Default Description |Cluster Setting| Default| Description|
batch_dir ctrl.cluster.data_location Directory where batch files are stored. |--------------|---------------|-------------------|
job_id_fn BATCH/job_id_file File with all the job IDs for each task. One line per task. 20 characters. -1 means the task is waiting to be submitted. |batch_dir| ctrl.cluster.data_location| Directory where batch files are stored.|
batch_id Lowest positive ID that is not used An integer containing the batch ID for this batch. |job_id_fn| BATCH/job_id_file| File with all the job IDs for each task. One line per task. 20 characters. -1 means the task is waiting to be submitted.|
in_fn_dir BATCH/in Inputs for each task: static.mat contains inputs that do not change, dynamic.mat contains inputs that do change in variables called dparam_TASKID. |batch_id| Lowest positive ID that is not used| An integer containing the batch ID for this batch.|
out_fn_dir BATCH/out Outputs and errors for each task: out_TASKID.mat file |in_fn_dir| BATCH/in| Inputs for each task: static.mat contains inputs that do not change, dynamic.mat contains inputs that do change in variables called dparam_TASKID.|
stdout_fn_dir BATCH/stdout Standard output for each task (slurm and torque only). Stored in stdout_TASKID.txt files and only one task (the one with the maximum task ID) per job will have the file. |out_fn_dir| BATCH/out| Outputs and errors for each task: out_TASKID.mat file|
error_fn_dir BATCH/error Standard error for each task (slurm and torque only). Stored in error_TASKID.txt files and only one task (the one with the maximum task ID) per job will have the file. |stdout_fn_dir| BATCH/stdout| Standard output for each task (slurm and torque only). Stored in stdout_TASKID.txt files and only one task (the one with the maximum task ID) per job will have the file.|
hold_fn BATCH/hold_fn Empty file that if present causes cluster_run to enter debug mode. Placed by cluster_hold.m |error_fn_dir| BATCH/error| Standard error for each task (slurm and torque only). Stored in error_TASKID.txt files and only one task (the one with the maximum task ID) per job will have the file.|
job_id_list NA Cached contents of job_id_fn |hold_fn| BATCH/hold_fn| Empty file that if present causes cluster_run to enter debug mode. Placed by cluster_hold.m|
task_id NA Last task ID used. |job_id_list| NA| Cached contents of job_id_fn|
submission_queue NA Vector representing queue of task IDs waiting to be submitted. Index 1 is front of queue. |task_id| NA| Last task ID used.|
job_status NA String containing the job status for each task. 'T': waiting to be submitted, 'Q/R': active, 'C': complete. |submission_queue| NA| Vector representing queue of task IDs waiting to be submitted. Index 1 is front of queue.|
error_mask NA Vector containing the error status for each task. 0 is no error. |job_status| NA| String containing the job status for each task. 'T': waiting to be submitted, 'Q/R': active, 'C': complete.|
retries NA Vector containing the number of retries used for each task |error_mask |NA| Vector containing the error status for each task. 0 is no error.|
active_jobs NA Number of active jobs. |retries| NA| Vector containing the number of retries used for each task|
notes NA Cache of the same field stored in sparam/dparam. This field is not always available. The values are stored in a cell array. Each cell contains the value of this field for the corresponding task (e.g. the first cell contains the "notes" for task_id 1). |active_jobs| NA| Number of active jobs.|
cpu_time NA Cache of the same field stored in sparam/dparam. This field is not always available. The values are stored in a vector. Each element contains the value of this field for the corresponding task (e.g. the first element contains the "cpu_time" for task_id 1). |notes| NA| Cache of the same field stored in sparam/dparam. This field is not always available. The values are stored in a cell array. Each cell contains the value of this field for the corresponding task (e.g. the first cell contains the "notes" for task_id 1).|
mem NA Cache of the same field stored in sparam/dparam. This field is not always available. The values are stored in a vector. Each element contains the value of this field for the corresponding task (e.g. the first element contains the "mem" for task_id 1). |cpu_time| NA| Cache of the same field stored in sparam/dparam. This field is not always available. The values are stored in a vector. Each element contains the value of this field for the corresponding task (e.g. the first element contains the "cpu_time" for task_id 1).|
success NA Cache of the same field stored in sparam/dparam. This field is not always available. The values are stored in a cell array. Each cell contains the value of this field for the corresponding task (e.g. the first cell contains the "success" for task_id 1). |mem| NA| Cache of the same field stored in sparam/dparam. This field is not always available. The values are stored in a vector. Each element contains the value of this field for the corresponding task (e.g. the first element contains the "mem" for task_id 1).|
file_success NA Cache of the same field stored in sparam/dparam. This field is not always available. The values are stored in a cell array. Each cell contains the value of this field for the corresponding task (e.g. the first cell contains the "file_success" for task_id 1). |success| NA| Cache of the same field stored in sparam/dparam. This field is not always available. The values are stored in a cell array. Each cell contains the value of this field for the corresponding task (e.g. the first cell contains the "success" for task_id 1).|
|file_success| NA| Cache of the same field stored in sparam/dparam. This field is not always available. The values are stored in a cell array. Each cell contains the value of this field for the corresponding task (e.g. the first cell contains the "file_success" for task_id 1).|
The settings for each task are stored in the "in" directory within each batch directory. These are stored in the static.mat and dynamic.mat files in the "in" directory. The parameter structure for a given task is the result of calling merge_structs.m on static_param (stored in static.mat) and dparam{task_id} (stored in dynamic.mat). The combined result should contain all the input parameters for the task. The settings for each task are stored in the "in" directory within each batch directory. These are stored in the static.mat and dynamic.mat files in the "in" directory. The parameter structure for a given task is the result of calling merge_structs.m on static_param (stored in static.mat) and dparam{task_id} (stored in dynamic.mat). The combined result should contain all the input parameters for the task.
NOTE: static fields that do not change for each task should be stored in "static_param" to save space on disk. It is the job of the script that is creating the tasks to determine where to store each parameter (in sparam or in dparam). NOTE: static fields that do not change for each task should be stored in "static_param" to save space on disk. It is the job of the script that is creating the tasks to determine where to store each parameter (in sparam or in dparam).
Input Files static.mat and dynamic.mat # Input Files static.mat and dynamic.mat
The input files (in/static.mat and in/dynamic.mat) are created by cluster_new_task. Their contents are set by the sparam and dparam input arguments to cluster_new_task. Usually, the function call looks like this: The input files (in/static.mat and in/dynamic.mat) are created by cluster_new_task. Their contents are set by the sparam and dparam input arguments to cluster_new_task. Usually, the function call looks like this:
```
ctrl = cluster_new_task(ctrl,sparam,dparam,'dparam_save',0); % Creating multiple tasks in a batch ctrl = cluster_new_task(ctrl,sparam,dparam,'dparam_save',0); % Creating multiple tasks in a batch
ctrl = cluster_new_task(ctrl,sparam,[]); % Creating a single task in a batch ctrl = cluster_new_task(ctrl,sparam,[]); % Creating a single task in a batch
```
The sparam input argument maps to the in/static.mat "static_param" variable. The sparam input argument maps to the in/static.mat "static_param" variable.
The dparam input argument for task TASK_ID maps to the in/dynamic.mat "dparam{TASK_ID}" variable. The dparam input argument for task TASK_ID maps to the in/dynamic.mat "dparam{TASK_ID}" variable.
The contents of the in/static.mat and in/dynamic.mat files are very similar and will be merged when each task executes. The fields in the in/static.mat input file are: The contents of the in/static.mat and in/dynamic.mat files are very similar and will be merged when each task executes. The fields in the in/static.mat input file are:
Cluster Setting Default Description |Cluster Setting| Default Description|
static_param NA Structure with static parameters. These are parameters that do not change from task to task within one batch. |---------------|--------------------------|
static_param.cpu_time 0 The CPU time in seconds to be requested for this task. |static_param| NA| Structure with static parameters. These are parameters that do not change from task to task within one batch.|
static_param.file_success {} Used by cluster_update_task.m to determine if a task has completed successfully. Each task's file_success field is a cell array of files that must contain a file_version field without a 'D' in the file_version to be considered successfully created. If all of the files in the cell array list exist and meet this condition, then the task is considered to have ran successfully. |static_param.cpu_time| 0 |The CPU time in seconds to be requested for this task.|
static_param.file_version ctrl.cluster.file_version This field is set to the value stored in ctrl.cluster.file_version when cluster_new_task is called. The version argument to the save function in Matlab. There is no default. It should be set to '-v7.3' and is generally set from cluster.file_version in cluster_new_task.m. |static_param.file_success| {}| Used by cluster_update_task.m to determine if a task has completed successfully. Each task's file_success field is a cell array of files that must contain a file_version field without a 'D' in the file_version to be considered successfully created. If all of the files in the cell array list exist and meet this condition, then the task is considered to have ran successfully.|
static_param.mem 0 The memory in bytes to be requested for this task. |static_param.file_version| ctrl.cluster.file_version| This field is set to the value stored in ctrl.cluster.file_version when cluster_new_task is called. The version argument to the save function in Matlab. There is no default. It should be set to '-v7.3' and is generally set from cluster.file_version in cluster_new_task.m.|
static_param.notes '' The notes field is used for debugging and is printed out by cluster_print and other cluster functions to provide information to the operator about the tasks. |static_param.mem| 0 |The memory in bytes to be requested for this task.|
static_param.num_args_out 0 The number of output arguments to expect for this task. |static_param.notes |''| The notes field is used for debugging and is printed out by cluster_print and other cluster functions to provide information to the operator about the tasks.|
static_param.success '' Success criteria evaluation string. Used by cluster_update_task.m to determine if task was successful. If the command in the string evaluates with a logical true, then the task is successful. |static_param.num_args_out| 0 |The number of output arguments to expect for this task.
static_param.taskfunction NA String containing the function to be run static_param.success '' Success criteria evaluation string. Used by cluster_update_task.m to determine if task was successful. If the command in the string evaluates with a logical true, then the task is successful.|
The fields in the in/dynamic.mat are: |static_param.taskfunction |NA| String containing the function to be run
The fields in the in/dynamic.mat are:|
Cluster Setting Description
dparam Cell array. Each cell contains a structure with the dynamic parameters for the corresponding task. The fields inside each cell array are the same as for the in/static.mat file's static_param structure. |Cluster Setting| Description|
dparam{TASKID} Structure with fields the same as in/static.mat file's static_param structure. These fields do not need to be defined in in/dynamic.mat or in/static.mat, but they must be defined in one or the other place. If defined in both files, the dparam{TASKID} will override the static parameters. |---------------|------------------|
dparam{TASKID}.file_success See in/static.mat description. |dparam| Cell array. Each cell contains a structure with the dynamic parameters for the corresponding task. The fields inside each cell array are the same as for the in/static.mat file's static_param structure.|
dparam{TASKID}.file_version See in/static.mat description. |dparam{TASKID}| Structure with fields the same as in/static.mat file's static_param structure. These fields do not need to be defined in in/dynamic.mat or in/static.mat, but they must be defined in one or the other place. If defined in both files, the dparam{TASKID} will override the static parameters.|
dparam{TASKID}.mem See in/static.mat description. |dparam{TASKID}.file_success| See in/static.mat description.|
dparam{TASKID}.notes See in/static.mat description. |dparam{TASKID}.file_version| See in/static.mat description.|
dparam{TASKID}.num_args_out See in/static.mat description. |dparam{TASKID}.mem| See in/static.mat description.|
dparam{TASKID}.success See in/static.mat description. |dparam{TASKID}.notes| See in/static.mat description.|
dparam{TASKID}.taskfunction See in/static.mat description. |dparam{TASKID}.num_args_out| See in/static.mat description.|
Output Files out_TASKID.mat |dparam{TASKID}.success| See in/static.mat description.|
|dparam{TASKID}.taskfunction| See in/static.mat description.|
# Output Files out_TASKID.mat
The output files (in/out_TASKID.mat and in/dynamic.mat) are created by cluster_job.m in regular operation and cluster_exec_task.m in the debug run_mode 1. The fields in the out/out_TASKID.mat output files are: The output files (in/out_TASKID.mat and in/dynamic.mat) are created by cluster_job.m in regular operation and cluster_exec_task.m in the debug run_mode 1. The fields in the out/out_TASKID.mat output files are:
Cluster Setting Description |Cluster Setting |Description|
argsout Cell array containing all of the output arguments of the task function. The number of output arguments must be specified in the input files' num_args_out field. |------------------------|----------|
cpu_time_actual Double scalar containing the number of seconds it took to execute the task. |argsout| Cell array containing all of the output arguments of the task function. The number of output arguments must be specified in the input files' num_args_out field.|
errorstruct If no error occurs, this field will be empty. If an error is thrown, the Matlab exception will be contained in this field. |cpu_time_actual| Double scalar containing the number of seconds it took to execute the task.|
Torque/PBS |errorstruct| If no error occurs, this field will be empty. If an error is thrown, the Matlab exception will be contained in this field.|
Common Commands
Always available # Torque/PBS
qstat ## Common Commands
**Always available**
**qstat**
List all jobs List all jobs
qstat -q -u USERNAME and qstat -r -u USERNAME **qstat -q -u USERNAME and qstat -r -u USERNAME**
List all queued or running jobs owned by USERNAME List all queued or running jobs owned by USERNAME
qdel 'all' **qdel 'all'**
Deletes all jobs that you control (normally just your jobs unless administrator). Deletes all jobs that you control (normally just your jobs unless administrator).
pbsnodes -l **pbsnodes -l**
Prints out node reservations Prints out node reservations
Only available on PBS/torque **Only available on PBS/torque**
showstart JOBID **showstart JOBID**
Shows scheduled time for job. Shows scheduled time for job.
checkjob JOBID **checkjob JOBID**
Prints out complete information about the job including scheduled time Prints out complete information about the job including scheduled time
Other Commands
qalter # Other Commands
**qalter**
Change the hold state of a job Change the hold state of a job
qdel JOBID **qdel JOBID**
Deletes a specific job Deletes a specific job
qsub **qsub**
Submit a job Submit a job
Interactive
## Interactive
Interactive mode lets you login to the cluster node with the environment setup in the same way it is when the job actually runs. This is helpful for debugging problems that have to do with running your code on the cluster nodes themselves. Interactive mode lets you login to the cluster node with the environment setup in the same way it is when the job actually runs. This is helpful for debugging problems that have to do with running your code on the cluster nodes themselves.
To run interactively, set cluster.interactive to true and then run your batch like you normally would. For each job, the torque_submit_job function will print out the steps to run the job interactively. It is not critical, but to keep the torque submission routines in sync you need to assign the new_job_id as one of these steps. Look for the output from qsub like this: To run interactively, set cluster.interactive to true and then run your batch like you normally would. For each job, the torque_submit_job function will print out the steps to run the job interactively. It is not critical, but to keep the torque submission routines in sync you need to assign the new_job_id as one of these steps. Look for the output from qsub like this:
```
qsub: waiting for job 2466505.m2 to start qsub: waiting for job 2466505.m2 to start
... INTERACTIVE SESSION ... ... INTERACTIVE SESSION ...
qsub: job 2466505.m2 completed qsub: job 2466505.m2 completed
```
In this case, run "new_job_id = 2466505" in matlab. m2 is the queue and is not part of the job id. In this case, run "new_job_id = 2466505" in matlab. m2 is the queue and is not part of the job id.
When debugging at IU, it is a good idea to add '-q debug' (short jobs) or '-q interactive' (long jobs) to your sched.submit_arguments because your job will get processed much faster than the default queue. When debugging at IU, it is a good idea to add '-q debug' (short jobs) or '-q interactive' (long jobs) to your sched.submit_arguments because your job will get processed much faster than the default queue.
Slurm #Slurm
Common Commands ## Common Commands
scontrol show job JOBID **scontrol show job JOBID**
List complete information for JOBID List complete information for JOBID
sinfo **sinfo**
List all the computer nodes and their state List all the computer nodes and their state
squeue --job JOBID **squeue --job JOBID**
Show status information for JOBID Show status information for JOBID
sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j JOBID --allsteps **sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j JOBID --allsteps**
List statistics about JOBID List statistics about JOBID
Other Commands
sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed ## Other Commands
**sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed**
List information about all jobs List information about all jobs
sacctmgr list qos **sacctmgr list qos**
List information about cluster such as maximum number of jobs that can be active/submitted at once, maximum cores, maximum cpu time, etc. List information about cluster such as maximum number of jobs that can be active/submitted at once, maximum cores, maximum cpu time, etc.
sbatch **sbatch**
Submit a job to the cluster to be run in the background Submit a job to the cluster to be run in the background
scancel JOBID **scancel JOBID**
Cancel or delete JOBID Cancel or delete JOBID
scontrol show jobid -dd <jobid> **scontrol show jobid -dd <jobid>**
List status info for a currently running job List status info for a currently running job
scontrol hold job_id **scontrol hold job_id**
Hold a job so that it does not run Hold a job so that it does not run
squeue -u <username> **squeue -u <username>**
List all jobs List all jobs
srun **srun**
Run a job on the cluster in interactive mode. For example: srun -N 1 -n 1 --mem=20000 --time=480 --pty bash Run a job on the cluster in interactive mode. For example: srun -N 1 -n 1 --mem=20000 --time=480 --pty bash
sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps **sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps**
List detailed information about a job List detailed information about a job
On ollie at AWI some sacct commands do not run, but these commands can be used: On ollie at AWI some sacct commands do not run, but these commands can be used:
sudo get_my_jobs.sh **sudo get_my_jobs.sh**
to get information about todays jobs to get information about todays jobs
sudo get_my_jobs.sh -t **sudo get_my_jobs.sh -t**
to get information about running jobs to get information about running jobs
Matlab Compiler
# Matlab Compiler
The Matlab compiler is only used with the Torque and Slurm cluster interfaces. Also, it is only required when the code changes. The compiled code is cross platform (i.e. it is possible to compile on a Windows computer and then run the compiled code on a Linux computer). Once cluster_job.m is compiled, it can be used without requiring any Matlab license. For the compiler to work, the Matlab Compiler Runtime libraries (freely distributable) must be installed. If Matlab is installed on a machine, there is a copy of the MCR files (for that version of Matlab) already available and the Matlab installation directory can be used for the MCR path as long as the same version of Matlab was used to compile the files. Inside cluster_new_batch, the matlabroot command is used to set the ctrl.cluster.matlab_mcr_path. The Matlab compiler is only used with the Torque and Slurm cluster interfaces. Also, it is only required when the code changes. The compiled code is cross platform (i.e. it is possible to compile on a Windows computer and then run the compiled code on a Linux computer). Once cluster_job.m is compiled, it can be used without requiring any Matlab license. For the compiler to work, the Matlab Compiler Runtime libraries (freely distributable) must be installed. If Matlab is installed on a machine, there is a copy of the MCR files (for that version of Matlab) already available and the Matlab installation directory can be used for the MCR path as long as the same version of Matlab was used to compile the files. Inside cluster_new_batch, the matlabroot command is used to set the ctrl.cluster.matlab_mcr_path.
If Matlab is not installed on the cluster or the compiled version is different than the Matlab version installed, then the gRadar.cluster.matlab_mcr_path should be set to the correct MCR path. Instructions to install the MCR libraries are available on Matlab's website (Matlab Runtime). If Matlab is not installed on the cluster or the compiled version is different than the Matlab version installed, then the gRadar.cluster.matlab_mcr_path should be set to the correct MCR path. Instructions to install the MCR libraries are available on Matlab's website (Matlab Runtime).
Currently, Matlab is required to submit and track jobs using Slurm and Torque even if the compiler is not needed. A feature needs to be added to support secure shell commands that would allow submission on cluster interfaces (i.e. head nodes) without a Matlab license. The task generation commands should also be compiled so that the dependent files (frames, records, gps, raw data, layer data, etc.) only need to be accessible from the cluster. A useful setup to do this would be to write a generic single job submission function that can run arbitrary functions through ssh; it would 1. compile the function, 2. copy the compiled function, inputs and outputs via scp, and 3. call the compiled function via ssh. This would be useful if a user has a personal Matlab license, but the cluster interface does not have a license and the personal computer does not have direct access to the cluster's file system. The Matlab Runtime/MCR libraries would still need to be installed on the cluster to run the compiled code. **Currently, Matlab is required to submit and track jobs using Slurm and Torque even if the compiler is not needed**. A feature needs to be added to support secure shell commands that would allow submission on cluster interfaces (i.e. head nodes) without a Matlab license. The task generation commands should also be compiled so that the dependent files (frames, records, gps, raw data, layer data, etc.) only need to be accessible from the cluster. A useful setup to do this would be to write a generic single job submission function that can run arbitrary functions through ssh; it would 1. compile the function, 2. copy the compiled function, inputs and outputs via scp, and 3. call the compiled function via ssh. This would be useful if a user has a personal Matlab license, but the cluster interface does not have a license and the personal computer does not have direct access to the cluster's file system. The Matlab Runtime/MCR libraries would still need to be installed on the cluster to run the compiled code.
TshethTalkPreferencesWatchlistContributionsLog outPageDiscussionReadView sourceView history TshethTalkPreferencesWatchlistContributionsLog outPageDiscussionReadView sourceView history
... ...
......